US COVID Data

EDA

US Covid Data

We leveraged the The Covid Tracking Project for analyzing the COVID impact on the United States. The COVID Tracking Project is being utilized by many leading and trusted groups such as the CDC, HHS, and Johns Hopkins.

Using the simple read.csv command we load the data from our Data folder for our Term Project Github repository.

us <- read.csv("https://raw.githubusercontent.com/JaclynCoate/6373_Time_Series/master/TermProject/Data/USdaily7.19.csv", header = T, strip.white = T)
str(us)
## 'data.frame':    179 obs. of  25 variables:
##  $ date                    : int  20200718 20200717 20200716 20200715 20200714 20200713 20200712 20200711 20200710 20200709 ...
##  $ states                  : int  56 56 56 56 56 56 56 56 56 56 ...
##  $ positive                : int  3692061 3626881 3549648 3478695 3413313 3350434 3291969 3230991 3167984 3101339 ...
##  $ negative                : int  41273443 40576852 39802297 39048662 38351244 37653841 36990207 36323867 35751437 34994707 ...
##  $ pending                 : int  3032 3002 2929 2947 960 2610 2639 2618 2493 2530 ...
##  $ hospitalizedCurrently   : int  57562 57705 57369 56143 55533 53988 52632 51834 51577 43924 ...
##  $ hospitalizedCumulative  : int  276439 274436 271758 269543 267127 264865 263651 262712 257571 255253 ...
##  $ inIcuCurrently          : int  6396 6453 6359 6317 6235 6074 5930 5939 5896 5845 ...
##  $ inIcuCumulative         : int  12342 12243 12091 12002 11857 11749 11679 11612 11523 11370 ...
##  $ onVentilatorCurrently   : int  2343 2353 2317 2317 2263 2259 2182 2169 2197 2127 ...
##  $ onVentilatorCumulative  : int  1211 1200 1175 1166 1154 1142 1136 1128 1118 1138 ...
##  $ recovered               : int  1122720 1106844 1090645 1075882 1049098 1031939 1006326 995576 983185 969111 ...
##  $ dateChecked             : Factor w/ 179 levels "2020-01-22T00:00:00Z",..: 179 178 177 176 175 174 173 172 171 170 ...
##  $ death                   : int  132395 131523 130572 129598 128740 128004 127677 127201 126444 125590 ...
##  $ hospitalized            : int  276439 274436 271758 269543 267127 264865 263651 262712 257571 255253 ...
##  $ lastModified            : Factor w/ 179 levels "2020-01-22T00:00:00Z",..: 179 178 177 176 175 174 173 172 171 170 ...
##  $ total                   : int  44968536 44206735 43354874 42530304 41765517 41006885 40284815 39557476 38921914 38098576 ...
##  $ totalTestResults        : int  44965504 44203733 43351945 42527357 41764557 41004275 40282176 39554858 38919421 38096046 ...
##  $ posNeg                  : int  44965504 44203733 43351945 42527357 41764557 41004275 40282176 39554858 38919421 38096046 ...
##  $ deathIncrease           : int  872 951 974 858 736 327 476 757 854 867 ...
##  $ hospitalizedIncrease    : int  2003 2678 2215 2416 2262 1214 939 5141 2318 1719 ...
##  $ negativeIncrease        : int  696591 774555 753635 697418 697403 663634 666340 572430 756730 639952 ...
##  $ positiveIncrease        : int  65180 77233 70953 65382 62879 58465 60978 63007 66645 58836 ...
##  $ totalTestResultsIncrease: int  761771 851788 824588 762800 760282 722099 727318 635437 823375 698788 ...
##  $ hash                    : Factor w/ 179 levels "0063e380f01c09a74ff18ab1940b36390ef365d9",..: 118 30 96 170 111 66 61 116 131 5 ...

US: Available Variables

  1. date - Date on which data was compiled by The COVID Tracking Project.
  2. states - Number of states and territories included in the US dataset for this day.
  3. positive - Total number of people with confirmed OR probable COVID-19 reported by the state or territory (per the expanded CSTE case definition of April 5th, 2020 approved by the CDC). A confirmed case is a person who has a positive test result from an FDA approved diagnostic molecular test. A probable case is a person who either has presentable symptoms WITH epidemiological evidence or has BOTH a positive presumptive laboratory test AND also EITHER presentable symptoms OR epidemiological evidence. Epidemiological evidence refers either to close-proximity contact with a known case or travel history to an area with high disease incidence. According to the guidelines, FDA approved antibody and antigen tests are considered part of presumptive laboratory evidence.
  4. negative - Individuals with a completed viral test that returned a negative result. For states / territories that do not report this number directly, we compute it using one of several methods, depending on which data points the state provides.
  5. pending - Total number of viral tests that have not been completed as reported by the state or territory.
  6. hospitalizedCurrently - Individuals who are currently hospitalized with COVID-19. Definitions vary by state / territory. Where possible, we report hospitalizations with confirmed or probable COVID-19 cases per the expanded CSTE case definition of April 5th, 2020 approved by the CDC.
  7. hospitalizedCumulative - Total number of individuals who have ever been hospitalized with COVID-19. Definitions vary by state / territory. Where possible, we report hospitalizations with confirmed or probable COVID-19 cases per the expanded CSTE case definition of April 5th, 2020 approved by the CDC.
  8. inIcuCurrently - Individuals who are currently hospitalized in the Intensive Care Unit with COVID-19. Definitions vary by state / territory. Where possible, we report patients in the ICU with confirmed or probable COVID-19 cases per the expanded CSTE case definition of April 5th, 2020 approved by the CDC.
  9. inIcuCumulative - Total number of individuals who have ever been hospitalized in the Intensive Care Unit with COVID-19. Definitions vary by state / territory. Where possible, we report patients in the ICU with confirmed or probable COVID-19 cases per the expanded CSTE case definition of April 5th, 2020 approved by the CDC.
  10. onVentilatorCurrently - Individuals who are currently hospitalized under advanced ventilation with COVID-19. Definitions vary by state / territory. Where possible, we report patients on ventilation with confirmed or probable COVID-19 cases per the expanded CSTE case definition of April 5th, 2020 approved by the CDC.
  11. onVentilatorCumulative - Total number of individuals who have ever been hospitalized under advanced ventilation with COVID-19. Definitions vary by state / territory. Where possible, we report patients on ventilation with confirmed or probable COVID-19 cases per the expanded CSTE case definition of April 5th, 2020 approved by the CDC.
  12. recovered - Total number of people that are identified as recovered from COVID-19. States provide disparate definitions on what constitutes as a “recovered” COVID-19 case. Types of “recovered” cases include those who are discharged from hospitals, released from isolation after meeting CDC guidance on symptoms cessation, or those who have not been identified as fatalities after a number of days (30 or more) post disease onset. Specifics vary for each state or territory.
  13. dateChecked - Deprecated. This is an old label for lastUpdateEt.
  14. death - Total fatalities with confirmed OR probable COVID-19 case diagnosis (per the expanded CSTE case definition of April 5th, 2020 approved by the CDC). In states where the information is available, it only tracks fatalities with confirmed OR probable COVID-19 case diagnosis where COVID-19 is an underlying cause of death according to the death certificate based on WHO guidelines.
  15. hospitalized - Deprecated. Old label for hospitalizedCumulative.
  16. lastModified - Deprecated. Old label for lastUpdateET.
  17. total - Deprecated. Computed by adding positive, negative, and pending values.
  18. totalTestRestuls - Currently computed by adding positive and negative values to work around reporting lags between positives and total tests and because some states do not report totals. Deprecated in the API and soon to be replaced on the website as well.
  19. posNeg - Deprecated. Computed by adding positive and negative values.
  20. deathIncrease - Increase in death computed by subtracting the value of death for the previous day from the value of death for the current day.
  21. hospitalizedIncrease - Increase in hospitalizedCumulative computed by subtracting the value of hospitalizedCumulative for the previous day from the value of hospitalizedCumulative for the current day.
  22. negativeIncrease - Increase in negative computed by subtracting the value of negative for the previous day from the value for negative from the current day.
  23. positiveIncrease - Increase in positive computed by subtracting the value of positive from the previous day from the value of positive for the current day.
  24. totalTestRestulsIncrease - Deprecated. Increase in totalTestResults computed by subtracting the value of totalTestResults for the previous day from the value of totalTestResults for the current day.
  25. hash - A hash for this record

Variable Evaluation and Formatting

us <- transform(us, date = as.Date(as.character(date), "%Y%m%d"))
head(us)
##         date states positive negative pending hospitalizedCurrently
## 1 2020-07-18     56  3692061 41273443    3032                 57562
## 2 2020-07-17     56  3626881 40576852    3002                 57705
## 3 2020-07-16     56  3549648 39802297    2929                 57369
## 4 2020-07-15     56  3478695 39048662    2947                 56143
## 5 2020-07-14     56  3413313 38351244     960                 55533
## 6 2020-07-13     56  3350434 37653841    2610                 53988
##   hospitalizedCumulative inIcuCurrently inIcuCumulative
## 1                 276439           6396           12342
## 2                 274436           6453           12243
## 3                 271758           6359           12091
## 4                 269543           6317           12002
## 5                 267127           6235           11857
## 6                 264865           6074           11749
##   onVentilatorCurrently onVentilatorCumulative recovered
## 1                  2343                   1211   1122720
## 2                  2353                   1200   1106844
## 3                  2317                   1175   1090645
## 4                  2317                   1166   1075882
## 5                  2263                   1154   1049098
## 6                  2259                   1142   1031939
##            dateChecked  death hospitalized         lastModified    total
## 1 2020-07-18T00:00:00Z 132395       276439 2020-07-18T00:00:00Z 44968536
## 2 2020-07-17T00:00:00Z 131523       274436 2020-07-17T00:00:00Z 44206735
## 3 2020-07-16T00:00:00Z 130572       271758 2020-07-16T00:00:00Z 43354874
## 4 2020-07-15T00:00:00Z 129598       269543 2020-07-15T00:00:00Z 42530304
## 5 2020-07-14T00:00:00Z 128740       267127 2020-07-14T00:00:00Z 41765517
## 6 2020-07-13T00:00:00Z 128004       264865 2020-07-13T00:00:00Z 41006885
##   totalTestResults   posNeg deathIncrease hospitalizedIncrease
## 1         44965504 44965504           872                 2003
## 2         44203733 44203733           951                 2678
## 3         43351945 43351945           974                 2215
## 4         42527357 42527357           858                 2416
## 5         41764557 41764557           736                 2262
## 6         41004275 41004275           327                 1214
##   negativeIncrease positiveIncrease totalTestResultsIncrease
## 1           696591            65180                   761771
## 2           774555            77233                   851788
## 3           753635            70953                   824588
## 4           697418            65382                   762800
## 5           697403            62879                   760282
## 6           663634            58465                   722099
##                                       hash
## 1 a6838d287f4a1270b49fde1e76e01bab4e2d7b2e
## 2 2fd5fa9462df840f6808d6757e4d13f702be8fdb
## 3 8644aacf2b4997ff08daf87ae088c7f6a44cb367
## 4 ef1be31b81541c1ff634f7adff6a5cebe14bab3e
## 5 9fff665c0a4f59af0ca8746fa5d90e91b35c0967
## 6 5b8ab51eb57a48502d534a29f1e5d62791ac8a24

Logical Variable Removal

During this project we will be trying to surface those metrics that properly evaluate the severity of COVID in the US. In order to simplify our data set we will remove all variables that do not directly link to metrics dealing with COVID or variables that are depcitign the same information. E.g. totalTestResults and postNeg.

us <- subset(us, select = -c(states, dateChecked, hospitalized, lastModified, total, posNeg, totalTestResultsIncrease, hash))
head(us)
##         date positive negative pending hospitalizedCurrently
## 1 2020-07-18  3692061 41273443    3032                 57562
## 2 2020-07-17  3626881 40576852    3002                 57705
## 3 2020-07-16  3549648 39802297    2929                 57369
## 4 2020-07-15  3478695 39048662    2947                 56143
## 5 2020-07-14  3413313 38351244     960                 55533
## 6 2020-07-13  3350434 37653841    2610                 53988
##   hospitalizedCumulative inIcuCurrently inIcuCumulative
## 1                 276439           6396           12342
## 2                 274436           6453           12243
## 3                 271758           6359           12091
## 4                 269543           6317           12002
## 5                 267127           6235           11857
## 6                 264865           6074           11749
##   onVentilatorCurrently onVentilatorCumulative recovered  death
## 1                  2343                   1211   1122720 132395
## 2                  2353                   1200   1106844 131523
## 3                  2317                   1175   1090645 130572
## 4                  2317                   1166   1075882 129598
## 5                  2263                   1154   1049098 128740
## 6                  2259                   1142   1031939 128004
##   totalTestResults deathIncrease hospitalizedIncrease negativeIncrease
## 1         44965504           872                 2003           696591
## 2         44203733           951                 2678           774555
## 3         43351945           974                 2215           753635
## 4         42527357           858                 2416           697418
## 5         41764557           736                 2262           697403
## 6         41004275           327                 1214           663634
##   positiveIncrease
## 1            65180
## 2            77233
## 3            70953
## 4            65382
## 5            62879
## 6            58465

NA Evaluation

Upon evaluating the columns that contain not available information, we can see in the early dates of data collection the US had a low count of COVID cases. Since it is likely that those metrics were actually zero since they weren’t even being collected at that time, we will repalce all NAs with zeros.

us[is.na(us)] <- 0
tail(us)
##           date positive negative pending hospitalizedCurrently
## 174 2020-01-27        2        0       0                     0
## 175 2020-01-26        2        0       0                     0
## 176 2020-01-25        2        0       0                     0
## 177 2020-01-24        2        0       0                     0
## 178 2020-01-23        2        0       0                     0
## 179 2020-01-22        2        0       0                     0
##     hospitalizedCumulative inIcuCurrently inIcuCumulative
## 174                      0              0               0
## 175                      0              0               0
## 176                      0              0               0
## 177                      0              0               0
## 178                      0              0               0
## 179                      0              0               0
##     onVentilatorCurrently onVentilatorCumulative recovered death
## 174                     0                      0         0     0
## 175                     0                      0         0     0
## 176                     0                      0         0     0
## 177                     0                      0         0     0
## 178                     0                      0         0     0
## 179                     0                      0         0     0
##     totalTestResults deathIncrease hospitalizedIncrease negativeIncrease
## 174                2             0                    0                0
## 175                2             0                    0                0
## 176                2             0                    0                0
## 177                2             0                    0                0
## 178                2             0                    0                0
## 179                2             0                    0                0
##     positiveIncrease
## 174                0
## 175                0
## 176                0
## 177                0
## 178                0
## 179                0

Zero variance variable check

All variables show a variance therefore we will keep all in the model for now

skim(us)
## Skim summary statistics
##  n obs: 179 
##  n variables: 17 
## 
## ── Variable type:Date ──────────────────────────────────────────────────────────
##  variable missing complete   n        min        max     median n_unique
##      date       0      179 179 2020-01-22 2020-07-18 2020-04-20      179
## 
## ── Variable type:integer ───────────────────────────────────────────────────────
##              variable missing complete   n          mean            sd
##         deathIncrease       0      179 179     739.64        736.87   
##  hospitalizedIncrease       0      179 179    1544.35       1972.24   
##              negative       0      179 179 9542260.32          1.2e+07
##      negativeIncrease       0      179 179  230577.89     228080.46   
##              positive       0      179 179   1e+06       1080318.9    
##      positiveIncrease       0      179 179   20626.03      18316.97   
##      totalTestResults       0      179 179       1.1e+07       1.3e+07
##     p0    p25     p50           p75          p100     hist
##      0    3.5     655    1216          2740       ▇▂▃▂▁▂▁▁
##  -2841    0      1245    2258.5       17320       ▁▇▂▁▁▁▁▁
##      0 2578   3269209       1.7e+07       4.1e+07 ▇▂▁▁▁▁▁▁
##     -3  626    136639  414259.5      774555       ▇▂▂▂▂▁▂▁
##      2 1161    786933 1866669.5     3692061       ▇▂▂▂▂▁▁▁
##      0  203.5   21590   28924.5       77233       ▇▂▇▂▁▁▁▁
##      2 3739   4056142       1.9e+07       4.5e+07 ▇▂▁▁▁▁▁▁
## 
## ── Variable type:numeric ───────────────────────────────────────────────────────
##                variable missing complete   n      mean        sd p0   p25
##                   death       0      179 179  50519.59  49837.79  0  26.5
##  hospitalizedCumulative       0      179 179 107636.91  1e+05     0   6  
##   hospitalizedCurrently       0      179 179  26293.04  21887.94  0   0  
##         inIcuCumulative       0      179 179   4237.38   4434.51  0   0  
##          inIcuCurrently       0      179 179   5476.62   5033.25  0   0  
##  onVentilatorCumulative       0      179 179    384.97    405.83  0   0  
##   onVentilatorCurrently       0      179 179   2346.8    2259.85  0   0  
##                 pending       0      179 179   6483.64  14080.45  0 330  
##               recovered       0      179 179 271316.65 343207.89  0   0  
##    p50      p75    p100     hist
##  39028  1e+05    132395 ▇▁▁▁▁▂▂▂
##  97369 215272    276439 ▇▁▁▁▂▁▃▂
##  30890  44527     59538 ▇▁▁▂▃▃▂▃
##   2193   8737.5   12342 ▇▂▁▁▂▂▂▂
##   5597   9069     15130 ▇▁▃▃▂▂▂▂
##    214    720      1211 ▇▂▁▁▂▂▁▂
##   2172   4501.5    7070 ▇▁▃▁▂▂▂▁
##   2186   3576     65709 ▇▁▁▁▁▁▁▁
##  69493 563494.5 1122720 ▇▂▁▁▁▁▁▁

COVID Case Cumulative Variable Review

Below we will use a basic lines graph by day to see the general trend of some of the basic cumulative variables that are provided by our raw data set. The cumulative variables are interesting to see the general trend of the data. However, these would not be beneficial from a forecasting iniative due to the fact that these will just continue to trend upwards as the virus spreads. It gives no real insight that could be used for our COVID relief efforts for hospitals.

Cumulative COVID Cases in US

We can see that the total cumulative COVID cases are increasing overtime with what appears to be no reprieve.

library(ggplot2)
library(ggthemes)

ggplot(data = us, aes(x=date, y=positive))+
  geom_line(color="black")+
  labs(title = "Cumulative COVID Cases US", y = "Millions", x = "") +
  theme_fivethirtyeight()

Cumulative COVID Hospitalized Cases in US

The total hospitilizations due to covid appear to also increase over time. We do see a little more movement in this graph, however, overall clearly increasing.

ggplot(data = us, aes(x=date, y=hospitalizedCumulative))+
  geom_line(color="orange")+
  labs(title = "Cumulative COVID Hospitalized Cases US", y = "Millions", x = "") +
  theme_fivethirtyeight()

New and Currently COVID Case Variable Review

In an attempt to look at a more micro view of the COVID data we review some of the new and cumulative variables below. We are looking to surface those variables that will be the most benficial in forecasting the severity of cases and therefore hospital resources moving forward.

New Confirmed COVID in US

When we start to investigate at a more micro level and review something like the daily new cases we see more trending behavior and something that can be more properly evaluated. Newly confirmed cases have a little more inforamtion to tell use from a trend perspective. We can see some pseudo-cyclic behavior alongside some heavy wandering behavior.

ggplot(data = us, aes(x=date, y=positiveIncrease))+
  geom_line(color="black")+
  labs(title = "New COVID Cases US", y = "Thousands", x = "") +
  theme_fivethirtyeight()

Currently COVID Hospitalizations in US

When measuring the severity of cases, we wanted to make sure that we are including the most severe cases; we believe that a good indicator of this would be a hospitalization metric. This is not only because the severe cases would end up hospitalized, but because of the intial fear of resources and space, it would also be a key forecast for medical communities to be prepared for potential next round of patients.

We have seen that the research suggests ‘that most people who contract the new coronavirus develop mild cases of Covid-19’ from an article released in June that can be found here. Due to this fact of likely a-symptomatic patients, hospitalization metrics reveal themselves to be a more accurate representation of severe cases.

Originally we were going to evaluate the new COVID hospitalization cases. However, upon closer review it became clear that large states that were hit the hardest were failling to reporting this metric. We were able to confirm this analysis when we compared the New versus Current metrics and see that there is a large spike in the hospitalization cases overall, but we are not seeing the same results reflected in the New COVID cases metric. Therefore, we err on the side of caution and move forward with modeling our Currently COVID hospitalized metric.

Looking at a more micro view of currently hospitalized COVID cases shows us much more behavior that we might be able to evaluate moving forward. Below we can see what appears to be heavy wandering behavior and if we look closely, some additional noise that could be pseudo-cyclic behavior hidden by the large numbers.

ggplot(data = us, aes(x=date, y=hospitalizedIncrease))+
  geom_line(color="orange")+
  labs(title = "New COVID Hospitalized Cases US", y = "Thousands", x = "") +
  theme_fivethirtyeight()

ggplot(data = us, aes(x=date, y=hospitalizedCurrently))+
  geom_line(color="orange")+
  labs(title = "Current COVID Hospitalized Cases US", y = "Thousands", x = "") +
  theme_fivethirtyeight()

New COVID Deaths in US

When we review the new death cumulative count we can see what again appears to be a pseudo-cyclic behavior with maybe some slight wandering tendencies.

ggplot(data = us, aes(x=date, y=deathIncrease))+
  geom_line(color="dark red")+
  labs(title = "New COVID Death Cases US", y = "Thousands", x = "") +
  theme_fivethirtyeight()

Caculated Variables Creation & Review

In order to review the data at an even smaller level we have created the positive percent metric as well as hospitilication, death, and ICU rates of positive case.

Positive Percent Rate

When reviewing the daily positivty rate, the metric discussed that is more reported on lately. We see the positivity rate actually decreasing over time. We would expect this outcome since we saw a drastic increase in testing over time as well.

While this metric is better than just new cases to forecast severity (it normalizes to number of tests), it doesn’t tell us anything about impact. We tend towards the currently hospitalized metrics because it allows us to take into consideration things such medical resources and why, we as a nation, were attempting to flatten the curve.

us$posRate <- us$positive / us$totalTestResults

ggplot(data = us, aes(x=date, y=posRate))+
  geom_line(color="black")+
  labs(title = "Daily Positivity Rate COVID Testing US", y = "Rate", x = "") +
  theme_fivethirtyeight()

Hospitalization Rate

When reviewing rates we also chose to observe the trend of the hospitalization rate of positively tested COVID patients. We see the hospitalization rate start very low then spike to around 15%. The hospitilization rate seems to hover in the low teens for a few months and then it appears to beging to decrease.

While reviewing this data we determined that this was not a best metric to review moving forward for forecasting severity of cases. The reason being is that individuals who were hospitalized for reasons other than COVID but then returned a positive test would be listed as a COVID patient, even if the virus itself was not affecting them (a-symptomatic.) Therefore we tend towards currently hospitalized as our severity metric, since it takes into account all hospitalizatoin resources and doesn’t just try to focus on a metric that appears to be misleading at this time.

us$hospRate <- us$hospitalizedCumulative / us$positive

ggplot(data = us, aes(x=date, y=hospRate))+
  geom_line(color="dark orange")+
  labs(title = "Daily Hospitalization Rate COVID Testing US", y = "Rate", x = "") +
  theme_fivethirtyeight()

Recovery Rate

In an opposite pattern of the previous rates we have evaluted we actually see a sharp increase in recovyer rate over time. It appears to be plateauing recently, but what we would expect is for it to continue to increase and provide a final percent of infected individuals who have recovered from COVID.

At first we thought the Recovery Rate would be a great opposite indicator to compare against a severity indicator. However, upon closer investigation we quickly realized that merely a few states were even collecting some of the recovered data. Without an actual number coming from all states this would not be a good refelcting of the recovery rate of COVID over all. This data would not be good to forecast.

us$recRate <- us$recovered / us$positive

ggplot(data = us, aes(x=date, y=recRate))+
  geom_line(color="dark green")+
  labs(title = "Daily Recovery Rate COVID Testing US", y = "Rate", x = "") +
  theme_fivethirtyeight()

Death Rate

Like the previous rates we have examined before we see an initial spike in death rate and what appears to be drop off as quickly on the rate of death itself. When we see the death rate hovering right under 4% we should also nte that the flu death rate is about 1%. That puts COVID about 3 times as deadly as the flu.

Since death is the most severe outcome of the COVID virus we have chosen to model this metric as a measure of severity along side currently hospitalized. With these two metrics we feel like we will have a better picture of severity and be able to give our nation a better idea

us$deathRate <- us$death / us$positive

ggplot(data = us, aes(x=date, y=deathRate))+
  geom_line(color="dark red")+
  labs(title = "Daily Death Rate COVID Testing US", y = "Rate", x = "") +
  theme_fivethirtyeight()

Matrix Scatterplot of Surfaced Variables

Trying to create pairs plots for all the variables against eachother would be very difficult to interpret. As suspected there are highly correlated variables since we are graphing things like total postive tests and new positive tests. These variables would be inherently correlated to one another because they are communicating similar data. Since this is the case we will only review a scatterplot of those variables we have surfaced above and will use for forecasting the multivariate testing.

library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
scatter <- subset(us, select = c(positive, negative, hospitalizedCurrently, hospitalizedCumulative, death, totalTestResults, deathIncrease, hospitalizedIncrease, negativeIncrease, positiveIncrease, posRate, deathRate))
ggpairs(scatter)

Goal 2: US COVID Data

Data Prep

In order to prep our data from our EDA we will load, transform the date, remove all zero row from the variable we are testing hospitalizedCurrently, and sort from oldest to newet. This will allow us to easily work through our univariate time series evaluations.

us <- read.csv("https://raw.githubusercontent.com/JaclynCoate/6373_Time_Series/master/TermProject/Data/USdaily.7.26.20.csv", header = T, strip.white = T)
us <- transform(us, date = as.Date(as.character(date), "%Y%m%d"))
us <- subset(us, select = -c(states, dateChecked, hospitalized, lastModified, total, posNeg, totalTestResultsIncrease, hash))
us[is.na(us)] <- 0
#Selecting only those dates with reported current hospitilizations
us <- dplyr::slice(us,1:132)
us = us[order(as.Date(us$date, format = "%Y%m%d")),]
head(us)
##           date positive negative pending hospitalizedCurrently
## 132 2020-03-17    11928    63104    1687                   325
## 131 2020-03-18    15099    84997    2526                   416
## 130 2020-03-19    19770   108407    3016                   617
## 129 2020-03-20    26025   138814    3330                  1042
## 128 2020-03-21    32910   177262    3468                  1436
## 127 2020-03-22    42169   213476    2842                  2155
##     hospitalizedCumulative inIcuCurrently inIcuCumulative
## 132                     55              0               0
## 131                     67              0               0
## 130                     85              0               0
## 129                    108              0               0
## 128                   2020              0               0
## 127                   3023              0               0
##     onVentilatorCurrently onVentilatorCumulative recovered death
## 132                     0                      0         0   122
## 131                     0                      0         0   153
## 130                     0                      0         0   199
## 129                     0                      0         0   267
## 128                     0                      0         0   328
## 127                     0                      0         0   471
##     totalTestResults deathIncrease hospitalizedIncrease negativeIncrease
## 132            75032            22                   13            13707
## 131           100096            31                   12            21893
## 130           128177            46                   18            23410
## 129           164839            68                   23            30407
## 128           210172            61                 1912            38448
## 127           255645           143                 1003            36214
##     positiveIncrease
## 132             3613
## 131             3171
## 130             4671
## 129             6255
## 128             6885
## 127             9259

Stationarity: Current Hospitalizations

It is difficult to assume stationarity for this data due to multiple factors. We are working under the assumption that COVID is a novel virus and cases as well as hospitalizations will eventually return to zero. This being said our current modeling techniques do things such as return to the median or mimic the previously seen trends. Also, we see a heavy wandering trend in both new cases and hospitalization would be dependent on this as well as time. We will review the data and see what, if any, non-stationary components reveal themselves and model the data accordingly.

Univariate AR/ARMA Modeling

  1. Original Realization Analysis Traits:
  • Heavy wandering behavior
  • What appears to be some noise that could be pseudo-cyclic behavior hidden by the large numbers.
ggplot(data = us, aes(x=date, y=hospitalizedCurrently))+
  geom_line(color="orange")+
  labs(title = "Current COVID Hospitalized Cases US", y = "Thousands", x = "") +
    theme_fivethirtyeight()

  1. Sample Realization, ACF, and Spectral Density Realization:
  • Heavy wandering behavior
  • Possible small pseudo-cyclic behavior

ACF:

  • Very slowly dampening behavior that would be consistent with a d=1 ARIMA model.

Spectral Density:

  • Peak at f=0
  • What appears to be a wave through the rest of the graph- this could be a hidden seasonality or another frequency peak that is hidden by the pseudo-cyclic behavior mentioned in above the realization above.
plotts.sample.wge(us$hospitalizedCurrently, lag.max = 100)

## $autplt
##   [1]  1.000000000  0.967247046  0.928472445  0.883791745  0.834039153
##   [6]  0.779901780  0.722099406  0.660670233  0.595507232  0.527652159
##  [11]  0.459798098  0.392707355  0.325070497  0.257394633  0.190076427
##  [16]  0.123107742  0.058045025 -0.005592108 -0.062510518 -0.114255855
##  [21] -0.163281121 -0.206979530 -0.242176807 -0.275097030 -0.300626756
##  [26] -0.323018458 -0.340862401 -0.357234304 -0.371125261 -0.380223777
##  [31] -0.387733922 -0.393904174 -0.399188337 -0.403702529 -0.407535556
##  [36] -0.409058093 -0.406032310 -0.402291057 -0.397307333 -0.392192952
##  [41] -0.385158345 -0.377444480 -0.367999137 -0.357234818 -0.345390268
##  [46] -0.333657101 -0.320405088 -0.307093151 -0.293895002 -0.279566463
##  [51] -0.263105425 -0.246363949 -0.229539058 -0.213137515 -0.196728805
##  [56] -0.180700291 -0.164151991 -0.145439160 -0.126569221 -0.107984741
##  [61] -0.090254314 -0.072242265 -0.054130291 -0.034756123 -0.014095709
##  [66]  0.007187366  0.028500485  0.048915792  0.068729196  0.088785284
##  [71]  0.109636722  0.130889993  0.152506791  0.173725394  0.193391213
##  [76]  0.211419137  0.228441065  0.245143327  0.260841689  0.274513731
##  [81]  0.286671443  0.296525403  0.304596844  0.311148685  0.317183678
##  [86]  0.321808447  0.324325398  0.323899788  0.321013182  0.315477248
##  [91]  0.307881329  0.298203689  0.286925511  0.272658784  0.256479904
##  [96]  0.236600281  0.213678525  0.188829126  0.163809020  0.138368627
## [101]  0.111541632
## 
## $freq
##  [1] 0.007575758 0.015151515 0.022727273 0.030303030 0.037878788
##  [6] 0.045454545 0.053030303 0.060606061 0.068181818 0.075757576
## [11] 0.083333333 0.090909091 0.098484848 0.106060606 0.113636364
## [16] 0.121212121 0.128787879 0.136363636 0.143939394 0.151515152
## [21] 0.159090909 0.166666667 0.174242424 0.181818182 0.189393939
## [26] 0.196969697 0.204545455 0.212121212 0.219696970 0.227272727
## [31] 0.234848485 0.242424242 0.250000000 0.257575758 0.265151515
## [36] 0.272727273 0.280303030 0.287878788 0.295454545 0.303030303
## [41] 0.310606061 0.318181818 0.325757576 0.333333333 0.340909091
## [46] 0.348484848 0.356060606 0.363636364 0.371212121 0.378787879
## [51] 0.386363636 0.393939394 0.401515152 0.409090909 0.416666667
## [56] 0.424242424 0.431818182 0.439393939 0.446969697 0.454545455
## [61] 0.462121212 0.469696970 0.477272727 0.484848485 0.492424242
## [66] 0.500000000
## 
## $db
##  [1]   8.4297671  13.7645189  12.5249640   8.3502117   4.2830812
##  [6]  -0.7743908  -1.6306179  -1.6820655  -1.1397181  -1.9660002
## [11]  -3.8470478  -4.5601791  -5.7979845  -6.6448260  -8.1613031
## [16]  -7.7172776  -7.0289454  -6.7231288 -11.3437841  -7.3292891
## [21]  -8.5570448  -9.8123688 -12.3352898 -12.0470153 -10.4188757
## [26] -11.1989248 -11.5484834 -11.3001319 -11.8583444 -13.4758586
## [31] -13.5752424 -13.1601144 -13.4822098 -11.8751077 -12.0098408
## [36] -11.6652933 -14.1857608 -18.7098443 -15.1452524 -14.2793937
## [41] -13.6312562 -13.4537203 -13.8449147 -14.8421169 -15.8144787
## [46] -16.3497508 -16.1392884 -14.5229548 -13.9096257 -14.2979673
## [51] -15.1762427 -16.9533791 -16.6240685 -15.5004175 -14.8930293
## [56] -15.3404690 -17.3904939 -15.6412620 -14.5937114 -14.5227461
## [61] -16.6299676 -17.9885318 -16.8369446 -17.2106857 -15.1218462
## [66] -13.7373278
## 
## $dbz
##  [1]  10.8000075  10.4225032   9.7877656   8.8879309   7.7133502
##  [6]   6.2551734   4.5113555   2.5000875   0.2870472  -1.9727273
## [11]  -4.0185421  -5.5981361  -6.6776527  -7.4337186  -8.0352652
## [16]  -8.5435576  -8.9672493  -9.3368790  -9.7184801 -10.1759984
## [21] -10.7327082 -11.3585971 -11.9855900 -12.5448786 -13.0053273
## [26] -13.3805586 -13.7009145 -13.9838700 -14.2295004 -14.4361103
## [31] -14.6143036 -14.7849900 -14.9660787 -15.1624073 -15.3669935
## [36] -15.5703444 -15.7685492 -15.9634820 -16.1567949 -16.3451059
## [41] -16.5214784 -16.6812308 -16.8255589 -16.9589464 -17.0832210
## [46] -17.1948352 -17.2887237 -17.3653174 -17.4335644 -17.5062347
## [51] -17.5910578 -17.6848979 -17.7756255 -17.8505813 -17.9054045
## [56] -17.9463455 -17.9845791 -18.0276392 -18.0745244 -18.1172420
## [61] -18.1468027 -18.1590805 -18.1567596 -18.1470189 -18.1375821
## [66] -18.1338177
  1. Overfit tables
  • Since we are seeing heavy wandering behavior, we will use overfit tables to see if we can surface any (1-B) factors that have roots very near the unit circle.

    • Below we are able to clearly see 1: (1-B) factor that has a root nearly on the Unit Circle.
est.ar.wge(us$hospitalizedCurrently,p=6,type='burg')
## 
## Coefficients of Original polynomial:  
## 1.2113 0.0324 -0.0917 -0.0697 0.0013 -0.1010 
## 
## Factor                 Roots                Abs Recip    System Freq 
## 1-1.9283B+0.9354B^2    1.0308+-0.0812i      0.9672       0.0125
## 1+0.9678B+0.3408B^2   -1.4200+-0.9583i      0.5838       0.4055
## 1-0.2507B+0.3168B^2    0.3957+-1.7322i      0.5628       0.2143
##   
## 
## $phi
## [1]  1.211269271  0.032440618 -0.091744816 -0.069675720  0.001316006
## [6] -0.100966743
## 
## $res
##   [1]  -292.92920    42.23585   -92.70196  -102.02123  -334.70493
##   [6]  -145.22402  -403.96302    34.46467   -83.49135  1318.62578
##  [11]  1377.16436  -880.96039  -636.85824  -375.11269   230.53004
##  [16]   589.34824  -162.87147   523.10092  2268.40156  -856.96328
##  [21]  1307.41009  4905.29484 -2079.33995  2125.99214 -1561.57046
##  [26]  -286.55877 -2910.62575  -721.73533  2310.48679  -741.05699
##  [31] -1458.33820  -885.25596 -1020.67917  -806.18397  1240.82698
##  [36]  3855.28027  -594.27955  -332.98691 -1561.70454   599.16915
##  [41]  -668.18241   910.28844   838.47931   588.30675  -699.57648
##  [46]   648.28101  -290.27821  -792.40531   664.36263  1721.18828
##  [51]  -247.47143  -655.56939 -1027.72246  -316.65232  -555.31836
##  [56]   670.18033  2208.79087   -46.31492  -741.64210  -711.16638
##  [61]  -345.85017  -802.33939  1055.50075  1019.45748   423.37585
##  [66]  -263.76408  -870.20100  -934.79262  -213.25332   762.11152
##  [71]   755.17733   958.19565  -257.19616 -1107.16022 -1097.48478
##  [76]  -392.31146    25.12350   266.29333  -150.37710    11.46933
##  [81]   -24.36326  -230.98962  -270.72490    81.75412   189.17778
##  [86]  -227.80796 -1100.72037  -289.22665  -380.30212  -216.89601
##  [91]   321.96954   641.36609   239.90315  -394.37919   -45.80287
##  [96]  -932.43620   101.31456   444.74934  1155.39447   225.84784
## [101]   -36.98998  -963.47841    90.12262  -630.67985   649.45247
## [106]  1114.69662   348.88102    78.19229  -682.63199  -511.33045
## [111]   -33.55819   480.50704  1404.41245   468.10554  -172.16756
## [116]  6696.26244 -2026.25073 -1457.43811  -294.49345   416.89154
## [121]  -780.81322   714.03055  -299.92450  -595.44881    59.38061
## [126]   536.60736   999.48047   240.91788   133.31315  -233.45378
## [131]  -301.48621  -282.07861
## 
## $avar
## [1] 1358148
## 
## $aic
## [1] 14.22769
## 
## $aicc
## [1] 15.25171
## 
## $bic
## [1] 14.38057
  1. Difference the data based on surfaced (1-B) Factor
  • Once the data has been differenced, we see something that looks much closer to a stationary data set. However, we have also surfaced what appears to be a small seasonality component. We see the ACF have higher spikes surface at 7 and 14, which would lead us to believe there is a 7-day seasonal component.
us.diff = artrans.wge(us$hospitalizedCurrently, phi.tr = 1, lag.max = 100)

acf(us.diff)

  1. Seasonality Transformation
  • Above we have surfaced what appears to be a 7-day seasonality trend. We will now transform the data for the s=7.
us.diff.seas = artrans.wge(us.diff,phi.tr = c(0,0,0,0,0,0,1), lag.max = 100)

  1. Diagnose Model w/ aic.wge
  • When we diagnose the best models to use for our stationary data set, we see the R AIC5 function selects an AIC ARMA(5,1) model while the BIC selects a AR(2). The AR(2) model is consistent with our pseudo-cyclic data as well as the dampening cyclical sample autocorrelations that are produced by the transformed data. The ARMA(5,1) could also produce these same traits. We will move forward and compare these two models.
aic5.wge(us.diff.seas)
## ---------WORKING... PLEASE WAIT... 
## 
## 
## Five Smallest Values of  aic
##       p    q        aic
## 17    5    1   14.57449
## 10    3    0   14.60386
## 13    4    0   14.61477
## 6     1    2   14.61527
## 8     2    1   14.61581
aic5.wge(us.diff.seas,type = "bic")
## ---------WORKING... PLEASE WAIT... 
## 
## 
## Five Smallest Values of  bic
##       p    q        bic
## 7     2    0   14.68775
## 10    3    0   14.69484
## 6     1    2   14.70625
## 8     2    1   14.70679
## 3     0    2   14.72173
  1. Diagnose white noise
  • Both of the Junge Box test show us that we reject the H null with p-values that are < 0.05 alpha significance level.
ljung.wge(us.diff.seas)$pval
## Obs 0.2522573 0.4096169 0.2803795 0.1452644 0.1021298 0.1354335 -0.2827722 0.07264095 -0.09804601 -0.01062144 0.004067316 -0.04130786 -0.03854601 -0.02866098 -0.09193388 -0.05167819 -0.1056372 -0.1193326 -0.08827532 -0.1061183 -0.1054081 -0.036613 -0.0654078 -0.03489691
## [1] 1.859792e-06
ljung.wge(us.diff.seas, K=48)$pval
## Obs 0.2522573 0.4096169 0.2803795 0.1452644 0.1021298 0.1354335 -0.2827722 0.07264095 -0.09804601 -0.01062144 0.004067316 -0.04130786 -0.03854601 -0.02866098 -0.09193388 -0.05167819 -0.1056372 -0.1193326 -0.08827532 -0.1061183 -0.1054081 -0.036613 -0.0654078 -0.03489691 0.02370281 -0.02743256 -0.02153882 -0.03584166 -0.1039903 -0.02877203 -0.03843455 -0.07298731 -0.03237146 -0.02495354 0.008256594 0.02396111 -0.06576515 -0.04980251 -0.04736059 -0.04415211 -0.03591561 -0.04413551 -0.03016308 0.03982533 0.008384104 0.02503231 0.03614677 0.01834679
## [1] 0.003990771
  1. Estimate Phis and Thetas
  • AIC Phi and Theta Estimates
est.us.diff.seasAIC = est.arma.wge(us.diff.seas, p = 5, q=1)
## 
## Coefficients of Original polynomial:  
## -0.6097 0.4850 0.5047 0.0643 -0.2269 
## 
## Factor                 Roots                Abs Recip    System Freq 
## 1+0.9305B             -1.0747               0.9305       0.5000
## 1+0.9405B+0.5775B^2   -0.8143+-1.0337i      0.7599       0.3562
## 1-1.2613B+0.4223B^2    1.4934+-0.3713i      0.6498       0.0388
##   
## 
mean(us$hospitalizedCurrently)
## [1] 39625.17
  • BIC Phi Estimates
est.us.diff.seasBIC = est.arma.wge(us.diff.seas, p = 2)
## 
## Coefficients of Original polynomial:  
## 0.1594 0.3740 
## 
## Factor                 Roots                Abs Recip    System Freq 
## 1-0.6964B              1.4359               0.6964       0.0000
## 1+0.5370B             -1.8622               0.5370       0.5000
##   
## 
mean(us$hospitalizedCurrently)
## [1] 39625.17

Univariate ARIMA(5,1,1), s=7 Forecasting

  • 7-Day Forecast
shortARMA <- fore.aruma.wge(us$hospitalizedCurrently, phi = est.us.diff.seasAIC$phi, theta = est.us.diff.seasAIC$theta, d= 1, s = 7, n.ahead = 7, lastn = F, limits = T)

  • AIC
est.us.diff.seasAIC$aic
## [1] 14.57449
  • Windowed ASE: 14,880,281
phis = est.us.diff.seasAIC$phi
thetas = est.us.diff.seasAIC$theta

trainingSize = 24
horizon = 7
ASEHolder = numeric()

for( i in 1:(124-(trainingSize + horizon) + 1))
{
  forecasts = fore.aruma.wge(us$hospitalizedCurrently[i:(i+(trainingSize-1))],phi = phis, theta = thetas, s = 7, d = 1, n.ahead = horizon)
  
  ASE = mean((us$hospitalizedCurrently[(trainingSize+i):(trainingSize+ i + (horizon) - 1)] - forecasts$f)^2)
         
  ASEHolder[i] = ASE
}
invisible(ASEHolder)
hist(ASEHolder)

WindowedASE = mean(ASEHolder)

summary(ASEHolder)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##     39401    754593   2108050  14880281   6998084 176230307
WindowedASE
## [1] 14880281
  • ASE: 73,605,156
fs = fore.aruma.wge(us$hospitalizedCurrently[i:(i+(trainingSize+horizon)-1)],phi = phis, theta = thetas, s = 7, d = 1,n.ahead = 7, lastn = TRUE)

ASE = mean((us$hospitalizedCurrently[(i+trainingSize):(i+(trainingSize+horizon)-1)] - fs$f )^2)
ASE
## [1] 73605156
  • 90-Day Forecast
longARMA <- fore.aruma.wge(us$hospitalizedCurrently, phi = est.us.diff.seasAIC$phi, theta = est.us.diff.seasAIC$theta, d= 1, s = 7, n.ahead = 90, lastn = F, limits = F)

  • Windowed ASE: 18,103,000,000
phis = est.us.diff.seasAIC$phi
thetas = est.us.diff.seasAIC$theta

trainingSize = 24
horizon = 90
ASEHolder = numeric()

for( i in 1:(124-(trainingSize + horizon) + 1))
{
  forecasts = fore.aruma.wge(us$hospitalizedCurrently[i:(i+(trainingSize-1))],phi = phis, theta = thetas, s = 7, d = 1,n.ahead = horizon)
  
  ASE = mean((us$hospitalizedCurrently[(trainingSize+i):(trainingSize+ i + (horizon) - 1)] - forecasts$f)^2)
         
  ASEHolder[i] = ASE
}
invisible(ASEHolder)
hist(ASEHolder)

WindowedASE = mean(ASEHolder)

summary(ASEHolder)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 1.083e+08 6.202e+08 8.410e+09 1.810e+10 2.989e+10 6.151e+10
WindowedASE
## [1] 1.8103e+10
  • ASE: 108,278,159
fs = fore.aruma.wge(us$hospitalizedCurrently[i:(i+(trainingSize+horizon)-1)],phi = phis, theta = thetas, s = 7, d = 1, n.ahead = 90, lastn = T)

ASE = mean((us$hospitalizedCurrently[(i+trainingSize):(i+(trainingSize+horizon)-1)] - fs$f )^2)
ASE
## [1] 108278159

Univariate ARIMA(2,1,0), s=7 Forecasting

  • 7-Day Forecast
shortAR <- fore.aruma.wge(us$hospitalizedCurrently, phi = est.us.diff.seasBIC$phi, d=1, s=7, n.ahead = 7, lastn = FALSE, limits = FALSE)

  • AIC
est.us.diff.seasBIC$aic
## [1] 14.61952
  • Windowed ASE: 15,546,758
phis = est.us.diff.seasBIC$phi
thetas = est.us.diff.seasBIC$theta

trainingSize = 24
horizon = 7
ASEHolder = numeric()

for( i in 1:(124-(trainingSize + horizon) + 1))
{
  forecasts = fore.aruma.wge(us$hospitalizedCurrently[i:(i+(trainingSize-1))],phi = phis, theta = thetas, s = 7, d = 1, n.ahead = horizon)
  
  ASE = mean((us$hospitalizedCurrently[(trainingSize+i):(trainingSize+ i + (horizon) - 1)] - forecasts$f)^2)
         
  ASEHolder[i] = ASE
}
invisible(ASEHolder)
hist(ASEHolder)

WindowedASE = mean(ASEHolder)

summary(ASEHolder)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##     77157    618880   2264284  15546758   6837785 202401118
WindowedASE
## [1] 15546758
  • ASE: 60,809,514
fs = fore.aruma.wge(us$hospitalizedCurrently[i:(i+(trainingSize+horizon)-1)],phi = phis, theta = thetas, s = 7, d = 1,n.ahead = 7, lastn = TRUE)

ASE = mean((us$hospitalizedCurrently[(i+trainingSize):(i+(trainingSize+horizon)-1)] - fs$f )^2)
ASE
## [1] 60809514
  • 90 Day Forecast
longAR <- fore.aruma.wge(us$hospitalizedCurrently, phi = est.us.diff.seasBIC$phi, s= 7, d = 1, n.ahead = 90, lastn = FALSE, limits = FALSE)

  • Windowed ASE: 19,427,666,679
phis = est.us.diff.seasBIC$phi
thetas = est.us.diff.seasBIC$theta

trainingSize = 24
horizon = 90
ASEHolder = numeric()

for( i in 1:(124-(trainingSize + horizon) + 1))
{
  forecasts = fore.aruma.wge(us$hospitalizedCurrently[i:(i+(trainingSize-1))],phi = phis, theta = thetas, s = 7, d = 1,n.ahead = horizon)
  
  ASE = mean((us$hospitalizedCurrently[(trainingSize+i):(trainingSize+ i + (horizon) - 1)] - forecasts$f)^2)
         
  ASEHolder[i] = ASE
}
ASEHolder
##  [1] 61052232487 54487603507 35485946674 27887241909 17356997661
##  [6]  7212077636  8960477292   598137404   130355537   347397993
## [11]   185865366
hist(ASEHolder)

WindowedASE = mean(ASEHolder)

summary(ASEHolder)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 1.304e+08 4.728e+08 8.960e+09 1.943e+10 3.169e+10 6.105e+10
WindowedASE
## [1] 19427666679
  • ASE: 185,865,366
fs = fore.aruma.wge(us$hospitalizedCurrently[i:(i+(trainingSize+horizon)-1)],phi = phis, theta = thetas, s = 7, d = 1,n.ahead = 90, lastn = TRUE)

ASE = mean((us$hospitalizedCurrently[(i+trainingSize):(i+(trainingSize+horizon)-1)] - fs$f )^2)
ASE
## [1] 185865366

Univariate Multilayered Perceptron (MLP) / Neural Network Model

For our univariate NN model we created a training and test data set. This allows us to cross validate our model performance. This is our first NN model and will be used with mostly default parameters. This is to see how our mlp function does in producing a model with few settings. However, with such little data we are also curious how leveraging all of the data changes the trend on the forecast. So, we will model them side by side to see the difference on what the forecasts produce.

  1. Creating train / test data set
library(nnfor)
head(us)
##           date positive negative pending hospitalizedCurrently
## 132 2020-03-17    11928    63104    1687                   325
## 131 2020-03-18    15099    84997    2526                   416
## 130 2020-03-19    19770   108407    3016                   617
## 129 2020-03-20    26025   138814    3330                  1042
## 128 2020-03-21    32910   177262    3468                  1436
## 127 2020-03-22    42169   213476    2842                  2155
##     hospitalizedCumulative inIcuCurrently inIcuCumulative
## 132                     55              0               0
## 131                     67              0               0
## 130                     85              0               0
## 129                    108              0               0
## 128                   2020              0               0
## 127                   3023              0               0
##     onVentilatorCurrently onVentilatorCumulative recovered death
## 132                     0                      0         0   122
## 131                     0                      0         0   153
## 130                     0                      0         0   199
## 129                     0                      0         0   267
## 128                     0                      0         0   328
## 127                     0                      0         0   471
##     totalTestResults deathIncrease hospitalizedIncrease negativeIncrease
## 132            75032            22                   13            13707
## 131           100096            31                   12            21893
## 130           128177            46                   18            23410
## 129           164839            68                   23            30407
## 128           210172            61                 1912            38448
## 127           255645           143                 1003            36214
##     positiveIncrease
## 132             3613
## 131             3171
## 130             4671
## 129             6255
## 128             6885
## 127             9259
usTrain.nn = ts(us$hospitalizedCurrently[1:125])
  1. Fitting NN model
  1. Fitted model on train data set. While we will continue to build on model on the train data set to get our ASE etc. We did want to see how 7 days can change We did this to see just how different merely 7 days can mean to a model.
us.nn.fit = mlp(usTrain.nn, outplot = T, comb = "mean", m=7, reps = 50)

plot(us.nn.fit)

  1. Fitted a model on the full data set. It shows that the same model is developed but we want to see how this affects our forecast. We suspect a data point up or down can drastically change the trend of the forecast.
us.nn.fit2 = mlp(ts(us$hospitalizedCurrently), outplot = T, comb = "mean", m=7, reps = 50)

plot(us.nn.fit2)

  1. Forecast horizon/step forward
  1. With just the trained data set being used we see a slightly trend upward in the 7-day forecast. However, the 90 day forecast is showing a much larger lift in trend towards the end of the forecast. We see a large lift in numbers. We also see the plausible range for the possibilities is also high. The NN models give us a glimpse into how difficult it is to forecast something like COVID hospitalization cases. With such limited data, and the lack of ability to know if we’ve completed a full ‘cycle’ for predictions against leaves us with many possible outcomes and the NN forecast shows us this by the large range of possible outcomes and the mean in blue in the center.
  • 7-Day Forecast
us.nn.fit.fore7 = forecast(us.nn.fit, h=7)
plot(us.nn.fit.fore7)

  • 90-Day Forecast
us.nn.fit.fore90 = forecast(us.nn.fit, h=90)
plot(us.nn.fit.fore90)

  1. For the 7-day forecast we still see an extensive range for the NN models, we do see a change in trend for full data set than the trained. We can see that those extra days of showing a downward trend project instead of a flat level for COVID hospitalized patients. This would be essentially a difference in being able to reduce supplies versus needing supplies to remain the same. This is important to take into account and means a daily update of the model would be needed to accurately forecast any future trend or predictions. For the 90-day forecast we see a similar change in trend. For the 90-day forecast based on the trained data set we expect to see a large increase in trend for the 90-day forecast. This could be useful, however, seems like it is performing directly opposite to what we expect from novel virus.
  • 7 Day Forecast
us.nn.fit.fore2 = forecast(us.nn.fit2, h=7)
plot(us.nn.fit.fore2)

  • 90-Day Forecast
us.nn.fit.fore2 = forecast(us.nn.fit2, h=90)
plot(us.nn.fit.fore2)

  1. Plot forecast against test set
plot(us$hospitalizedCurrently[126:132], type = "l", ylim = c(55000, 80000))
lines(seq(1:7), us.nn.fit.fore7$mean, col = "blue")

  1. ASE: 1,302,298 -7-Day
ASEus.nn.fit.fore7 = mean((us$hospitalizedCurrently[126:132]-us.nn.fit.fore7$mean)^2)
ASEus.nn.fit.fore7
## [1] 1777856

-90-Day

ASEus.nn.fit.fore90 = mean((us$hospitalizedCurrently[43:132]-us.nn.fit.fore90$mean)^2)
ASEus.nn.fit.fore90
## [1] 427549097

MLP NN Model Analysis

We completed a default neural network model. With so many opportunities for how to actually tune neural network model we knew this would not be our best model in this case. So, we moved forward with a hyper tuned neural network model for our ensemble model that allows us to calculate many windowed ASEs and compare those models against each other.

Ensemble Model / Hyper tuned NN Model

library(tswgewrapped)
  1. Train / Test Data Sets
set.seed(3)
data_train.u <- data.frame(hospitalizedCurrently = us$hospitalizedCurrently[1:122], positiveIncrease = rnorm(122, 0, .0001))
data_test.u <- data.frame(hospitalizedCurrently = us$hospitalizedCurrently[123:132], positiveIncrease = rnorm(10, 0, .0001))
  1. Hyper tune parameters
  • Here we are running specialty function contained in tswgewrapped package that allows us to perform a grid search that will complete the tuning of all parameters to obtain the one with the lowest windowed ASE.
# search for best NN hyperparameters in given grid
model.u = tswgewrapped::ModelBuildNNforCaret$new(data = data_train.u, var_interest = "hospitalizedCurrently",
                                               search = 'random', tuneLength = 5, parallel = TRUE,
                                               batch_size = 50, h = 7, m = 7,
                                               verbose = 1)
## Registered S3 method overwritten by 'lava':
##   method     from   
##   print.pcor greybox
## Aggregating results
## Selecting tuning parameters
## Fitting reps = 16, hd = 1, allow.det.season = FALSE on full training set
## - Total Time for training: : 46.573 sec elapsed
  1. The windowed ASEs associated with the grid of hyperparameters is shown in the table and heatmap below.
res.u <- model.u$summarize_hyperparam_results()
res.u
##   reps hd allow.det.season     RMSE      ASE   RMSESD    ASESD
## 1   10  2             TRUE 4873.573 46711465 5025.507 79166325
## 2   11  3            FALSE 3840.162 31663386 4313.721 69744266
## 3   16  1            FALSE 2953.278 11858347 1857.457 10343460
## 4   16  3             TRUE 3991.706 33375228 4380.144 70312087
## 5   22  3            FALSE 3957.298 32780242 4339.590 70283659
model.u$plot_hyperparam_results()

  1. Best Parameters shown in below table. The best hyperparameters based on this grid search are listed below
best.u <- model.u$summarize_best_hyperparams()
  1. Windowed ASE is below.
final.ase.u <- dplyr::filter(res.u, reps == best.u$reps &
                    hd == best.u$hd &
                    allow.det.season == best.u$allow.det.season)[['ASE']]
final.ase.u
## [1] 11858347
  1. Ensemble model characteristics and plot
# Ensemble / Hypertuned NN Model
caret_model.u = model.u$get_final_models(subset = 'a')
caret_model.u$finalModel
## MLP fit with 1 hidden node and 16 repetitions.
## Series modelled in differences: D1.
## Univariate lags: (1,2,3)
## 1 regressor included.
## - Regressor 1 lags: (4)
## Forecast combined using the median operator.
## MSE: 1427611.8956.
#Plot Final Model
plot(caret_model.u$finalModel)

  1. Train ensemble model
  • Since we came across an interesting outcome with our nn default model above when we reviewed the forecasts on the trained data vs full data set; we will do the same with our hypertuned parameters ensemble model.
  1. First we build our ensemble model on the trained data set.
#Ensemble model trained data
ensemble.mlp.u1 = nnfor::mlp(usTrain.nn, outplot = T, reps = best.u$reps, hd = best.u$hd, allow.det.season = F)

ensemble.mlp.u1
## MLP fit with 1 hidden node and 16 repetitions.
## Series modelled in differences: D1.
## Univariate lags: (1,2,3)
## Forecast combined using the median operator.
## MSE: 1452364.1472.
  1. Next we build our model on the entirity data set.
#Ensemble model
ensemble.mlp.u2 = nnfor::mlp(ts(us$hospitalizedCurrently), outplot = T, reps = best.u$reps, hd = best.u$hd, allow.det.season = F)

ensemble.mlp.u2
## MLP fit with 1 hidden node and 16 repetitions.
## Series modelled in differences: D1.
## Univariate lags: (1,2,3)
## Forecast combined using the median operator.
## MSE: 1386442.1448.
  • 7-Day
  1. First we will 7-day forecast our trained data set. As we can see we see a flattening of the forecast over the next 7 days witha slight downward trend..
fore7.1.u = forecast(ensemble.mlp.u1 , h=7)
#grabbing prediction intervals for 7 day forecast
all.mean7 <- data.frame(fore7.1.u$all.mean)
ranges7 <- data.frame(apply(all.mean7, MARGIN = 1, FUN = range))
subtracts7 <- ranges7 - as.list(ranges7[1,])
nintyperc7 <- data.frame(mapply(`*`,subtracts7,.9,SIMPLIFY=FALSE))
diffs7 <- data.frame(mapply(`/`,nintyperc7,2,SIMPLIFY = FALSE))
diffs7 = diffs7[-1,]
vector7 <-  as.numeric(diffs7[1,])

plot(fore7.1.u)
lines(seq(126,132,1), (fore7.1.u$mean + vector7), type = "l", col = "red")
lines(seq(126,132,1), (fore7.1.u$mean - vector7), type = "l", col = "red")

  1. Next we forecasted the 7-day for our full data set. We see a much strong downward trend in our 7 day forecast.
fore7.2.u = forecast(ensemble.mlp.u2, h=7)
plot(fore7.2.u)

  • 90-Day
  1. With our 90-day trained data set forecast we see a very strong downward trend. With a possibility of a variation of this trend being a little less straight (shown by the shadowed line that is slightly higher than the highlighted blue mean).
fore90.1.u = forecast(ensemble.mlp.u1, h=90)
#grabbing prediction intervals for 90 day forecast
all.mean90 <- data.frame(fore90.1.u$all.mean)
ranges90 <- data.frame(apply(all.mean90, MARGIN = 1, FUN = range))
subtracts90 <- ranges90 - as.list(ranges90[1,])
nintyperc90 <- data.frame(mapply(`*`,subtracts90,.9,SIMPLIFY=FALSE))
diffs90 <- data.frame(mapply(`/`,nintyperc90,2,SIMPLIFY = FALSE))
diffs90 = diffs90[-1,]
vector90 <-  as.numeric(diffs90[1,])

plot(fore90.1.u)
lines(seq(126,215,1), (fore90.1.u$mean + vector90), type = "l", col = "red")
lines(seq(126,215,1), (fore90.1.u$mean - vector90), type = "l", col = "red")

  1. Full data set. We see a clear downward trend and that the COVID hospitalized cases will eventually reach zero. The possibilities all stay closely to the mean and there doesn’t seem to be much of a chance of a deviation from this model.
fore90.2.u = forecast(ensemble.mlp.u2, h=90)
plot(fore90.2.u)

Univaraite Model Analysis

Upon completion of the above models we can see that the most important take away is that each data point is essential in determining the trend of the COVID virus. We can see that with cross validation methods we can see a trend but as each of those data points become a piece of the model the trend alters day by day. It will be essential moving forward that models are update daily to be able to acquire a good trend and therefore ability to forecast the needs for hospitalizes and the severity of COVID moving forward.

When investigating these models, it became clear that the 90-day forecasts were simply repeating the trend and seasonality without much extrapolation that we would recommend using for long term forecast. We would only recommend using the short 7 day forecast for predicting hospital equipment and staffing needs. The ensemble model had the lowest windowed ASE and is what we recommend moving forward for these short term forecasts.

Univariate Model Performance Breakdown

ARIMA(5,1,1), s=7 Forecasting

  • 7-Day Windowed ASE
    • 14,880,281
  • 90-Day Windowed ASE
    • 18,103,000,000

ARIMA(2,1,0), s=7 Forecasting

  • 7-Day Windowed ASE
    • 15,546,758
  • 90-Day Windowed ASE
    • 19,427,666,679

Default Neural Network Model

  • 7-Day ASE
    • 1,511,391
  • 90-Day ASE
    • 844,457,147

Ensemble Model: Hyper Tuned Neural Network Model

  • 7-Day Windowed ASE
    • 12,177,586

Goal 3: US COVID Data

Data Prep

In order to prep our data from our EDA we will load, transform the date, remove all zero row from the variable we are testing hospitalizedCurrently, and sort from oldest to newest. This will allow us to easily work through our multivariate time series evaluations.

us <- read.csv("https://raw.githubusercontent.com/JaclynCoate/6373_Time_Series/master/TermProject/Data/USdaily.7.26.20.csv", header = T, strip.white = T)
us <- transform(us, date = as.Date(as.character(date), "%Y%m%d"))
us <- subset(us, select = -c(states, dateChecked, hospitalized, lastModified, total, posNeg, totalTestResultsIncrease, hash))
us[is.na(us)] <- 0
us = us[order(as.Date(us$date, format = "%Y%m%d")),]
head(us)
##           date positive negative pending hospitalizedCurrently
## 187 2020-01-22        2        0       0                     0
## 186 2020-01-23        2        0       0                     0
## 185 2020-01-24        2        0       0                     0
## 184 2020-01-25        2        0       0                     0
## 183 2020-01-26        2        0       0                     0
## 182 2020-01-27        2        0       0                     0
##     hospitalizedCumulative inIcuCurrently inIcuCumulative
## 187                      0              0               0
## 186                      0              0               0
## 185                      0              0               0
## 184                      0              0               0
## 183                      0              0               0
## 182                      0              0               0
##     onVentilatorCurrently onVentilatorCumulative recovered death
## 187                     0                      0         0     0
## 186                     0                      0         0     0
## 185                     0                      0         0     0
## 184                     0                      0         0     0
## 183                     0                      0         0     0
## 182                     0                      0         0     0
##     totalTestResults deathIncrease hospitalizedIncrease negativeIncrease
## 187                2             0                    0                0
## 186                2             0                    0                0
## 185                2             0                    0                0
## 184                2             0                    0                0
## 183                2             0                    0                0
## 182                2             0                    0                0
##     positiveIncrease
## 187                0
## 186                0
## 185                0
## 184                0
## 183                0
## 182                0

Stationarity: Current Hospitalizations

It is difficult to assume stationarity for this data due to multiple factors. We are working under the assumption that COVID is a novel virus and cases as well as hospitalizations will eventually return to zero. This being said our current modeling techniques do things such as return to the median or mimic the previously seen trends. Also, we see a heavy wandering trend in both new cases and hospitalization would be dependent on this as well as time. We will review the data and see what, if any, non-stationary components reveal themselves and model the data accordingly.

Current Hospitalization Realization

Traits:

  • Heavy wandering behavior
  • What appears to be some noise that could be pseudo-cyclic behavior hidden by the large numbers.
ggplot(data = us, aes(x=date, y=hospitalizedCurrently))+
  geom_line(color="orange")+
  labs(title = "Current COVID Hospitalized Cases US", y = "Thousands", x = "") +
    theme_fivethirtyeight()

Independent Variable Review

When reviewing the variables and the scatter plot from our previous EDA we can see that there are some correlations that were expected, for example currentHospitalizations is correlated with the variables that reflect ICU and Ventilator patients. These metrics wouldn’t exist without hospitalizations. These variables also cannot help up predict hospitalizations because they after occurrences. So, we will actually be leveraging variables such as positive increase in order to see if there is some sort of correlation between hospitalized patients and the number of positive cases.

  • Positive Increase Trend
ggplot(data = us, aes(x=date, y=positiveIncrease))+
  geom_line(color="orange")+
  labs(title = "Increase in COVID Cases US", y = "Thousands", x = "") +
    theme_fivethirtyeight()

Multivariate MLR w/ Correlated Errors for Currently Hospitalized Patients

Forecast Independent Variables: Increase Positive Cases

  1. Evaluation: Slowly dampening ACF and heavy wandering consistent with a (1-B) factor. As well as frequency peaks at 0 and .14. Consistent with (1-B) and seasonality component of 7.
#a
plotts.sample.wge(us$positiveIncrease)

## $autplt
##  [1] 1.0000000 0.9740745 0.9439116 0.9109437 0.8869924 0.8711719 0.8605261
##  [8] 0.8462935 0.8144161 0.7772852 0.7356766 0.7045767 0.6829609 0.6646632
## [15] 0.6432819 0.6088202 0.5688872 0.5293417 0.5001353 0.4761589 0.4592724
## [22] 0.4426271 0.4148783 0.3789333 0.3458221 0.3182895
## 
## $freq
##  [1] 0.005347594 0.010695187 0.016042781 0.021390374 0.026737968
##  [6] 0.032085561 0.037433155 0.042780749 0.048128342 0.053475936
## [11] 0.058823529 0.064171123 0.069518717 0.074866310 0.080213904
## [16] 0.085561497 0.090909091 0.096256684 0.101604278 0.106951872
## [21] 0.112299465 0.117647059 0.122994652 0.128342246 0.133689840
## [26] 0.139037433 0.144385027 0.149732620 0.155080214 0.160427807
## [31] 0.165775401 0.171122995 0.176470588 0.181818182 0.187165775
## [36] 0.192513369 0.197860963 0.203208556 0.208556150 0.213903743
## [41] 0.219251337 0.224598930 0.229946524 0.235294118 0.240641711
## [46] 0.245989305 0.251336898 0.256684492 0.262032086 0.267379679
## [51] 0.272727273 0.278074866 0.283422460 0.288770053 0.294117647
## [56] 0.299465241 0.304812834 0.310160428 0.315508021 0.320855615
## [61] 0.326203209 0.331550802 0.336898396 0.342245989 0.347593583
## [66] 0.352941176 0.358288770 0.363636364 0.368983957 0.374331551
## [71] 0.379679144 0.385026738 0.390374332 0.395721925 0.401069519
## [76] 0.406417112 0.411764706 0.417112299 0.422459893 0.427807487
## [81] 0.433155080 0.438502674 0.443850267 0.449197861 0.454545455
## [86] 0.459893048 0.465240642 0.470588235 0.475935829 0.481283422
## [91] 0.486631016 0.491978610 0.497326203
## 
## $db
##  [1]  14.3997611  15.8955909   7.7642205   8.7050186   3.8198773
##  [6]  -1.1863576   2.6815084  -0.4667544  -1.0582274  -3.9729372
## [11]  -4.4388886  -5.3739230  -7.0574666  -3.9001118  -5.3732547
## [16]  -5.9203161  -6.0265835  -6.3073466  -8.7427682  -8.5315257
## [21]  -8.1694048  -8.8726978  -6.2228751  -5.6255282  -5.7869081
## [26]  -2.3190255  -0.6895940 -18.2510780 -12.7252753 -17.9895004
## [31] -13.1495229 -17.1809062 -12.9879824 -18.5885318 -14.0950716
## [36] -19.2823521 -16.1802599 -22.2193540 -17.1778910 -15.9864725
## [41] -13.8720977 -14.3606703 -12.3482825 -19.0159734 -12.8299009
## [46] -16.1634117 -19.2402721 -16.9940391 -19.9193164 -24.4462952
## [51] -12.2184745 -14.1347405 -11.1407485 -19.5608640 -22.3406178
## [56] -26.2948840 -15.9219942 -19.7896219 -16.3429699 -16.6545137
## [61] -13.2991307 -20.9465604 -12.5392646 -23.3015328 -21.5282804
## [66] -17.4748203 -26.6390880 -24.5923917 -17.5647166 -19.7282494
## [71] -17.6845420 -23.0825883 -19.0743556 -19.4958696 -19.5138333
## [76] -20.4270518 -18.7343253 -16.8982195 -14.1306252 -14.2054286
## [81] -15.0466299 -13.4474119 -14.7662082 -15.2848488 -14.7767269
## [86] -17.2543647 -17.5524024 -16.7680943 -23.6931393 -19.6888637
## [91] -22.9328710 -17.1805763 -15.7422487
## 
## $dbz
##  [1]  12.1744244  11.8068871  11.1912920  10.3234897   9.1987290
##  [6]   7.8134517   6.1690554   4.2794940   2.1855400  -0.0227779
## [11]  -2.1856123  -4.0930116  -5.5814407  -6.6274820  -7.3078903
## [16]  -7.6988921  -7.8526856  -7.8271468  -7.6919809  -7.5061868
## [21]  -7.3030888  -7.0977786  -6.9055053  -6.7545171  -6.6859448
## [26]  -6.7448940  -6.9710578  -7.3933729  -8.0283548  -8.8797888
## [31]  -9.9376266 -11.1745795 -12.5395304 -13.9486256 -15.2804168
## [36] -16.3913300 -17.1665125 -17.5812779 -17.7097183 -17.6695854
## [41] -17.5637030 -17.4562894 -17.3753921 -17.3239844 -17.2913960
## [46] -17.2628022 -17.2262162 -17.1767997 -17.1186074 -17.0640034
## [51] -17.0309217 -17.0382729 -17.1002705 -17.2209453 -17.3905128
## [56] -17.5857017 -17.7761429 -17.9367144 -18.0606665 -18.1649087
## [61] -18.2832201 -18.4521578 -18.6979308 -19.0286543 -19.4319317
## [66] -19.8765383 -20.3176913 -20.7053075 -20.9929698 -21.1442791
## [71] -21.1356798 -20.9585908 -20.6228760 -20.1586258 -19.6118560
## [76] -19.0349178 -18.4769322 -17.9782467 -17.5690679 -17.2703544
## [81] -17.0951532 -17.0493949 -17.1318377 -17.3332302 -17.6350503
## [86] -18.0085946 -18.4157858 -18.8133902 -19.1613062 -19.4324915
## [91] -19.6189508 -19.7292633 -19.7788462
  1. Differencing Data: much more stationary data and have surfaced a seasonal component = 7 seen on ACF peaks at 7, 14, 21
#b
inpos.diff = artrans.wge(us$positiveIncrease, phi.tr = 1, lag.max = 100)

  1. Seasonality Transformation: stationary data, and an ACF that reflects data other than white noise to be modeled.
#c
inpos.diff.seas = artrans.wge(inpos.diff,phi.tr = c(0,0,0,0,0,0,1), lag.max = 100)

  1. Model IDing: diagnose with aic.wge to determine phi, thetas: Both AIC/BIC select ARMA(4,2)
#d
aic5.wge(inpos.diff.seas)
## ---------WORKING... PLEASE WAIT... 
## 
## 
## Five Smallest Values of  aic
##       p    q        aic
## 15    4    2   15.62368
## 9     2    2   15.74701
## 18    5    2   15.74771
## 12    3    2   15.75351
## 7     2    0   15.75551
aic5.wge(inpos.diff.seas, type = "bic")
## ---------WORKING... PLEASE WAIT... 
## 
## 
## Five Smallest Values of  bic
##       p    q        bic
## 15    4    2   15.74832
## 2     0    1   15.79201
## 7     2    0   15.80893
## 4     1    0   15.81457
## 3     0    2   15.81522
  1. White Noise Test
#e
acf(inpos.diff.seas)

ljung.wge(inpos.diff.seas)$pval
## Obs -0.3711871 -0.02127283 0.07854862 0.05194601 -0.1044838 0.3276797 -0.4658847 0.1565306 0.1284186 -0.1060791 -0.07693327 0.1030723 -0.06391648 0.05002847 0.004471288 -0.03640442 -0.01551996 0.04133725 -0.01110058 -0.08153895 0.06570572 -0.03709739 -0.02390062 0.04588417
## [1] 1.246447e-12
  1. Estimate Phis, Thetas
  2. Model Building
#f
est.inpos.seas = est.arma.wge(inpos.diff.seas, p = 4, q = 2)
## 
## Coefficients of Original polynomial:  
## -1.8727 -1.4974 -0.4735 0.0283 
## 
## Factor                 Roots                Abs Recip    System Freq 
## 1+1.1293B+0.6983B^2   -0.8086+-0.8822i      0.8357       0.3681
## 1+0.7944B             -1.2588               0.7944       0.5000
## 1-0.0510B              19.6244               0.0510       0.0000
##   
## 
mean(us$positiveIncrease)
## [1] 22567.12
#g
  1. Forecast
  • 7 Day
#7 day
inpos.preds7 = fore.aruma.wge(us$positiveIncrease, phi = est.inpos.seas$phi, theta = est.inpos.seas$theta, d=1, s=7, n.ahead = 7)

  • 90 Day
#90 day
inpos.preds90 = fore.aruma.wge(us$positiveIncrease, phi = est.inpos.seas$phi, theta = est.inpos.seas$theta, d=1, s=7, n.ahead = 90)

  1. Plotting forecasts
  • 7 Day Forecast
#7 day forecast
plot(seq(1,187,1), us$positiveIncrease, type = "l", xlim = c(0,195), ylim = c(0,80000), ylab = "Increase in COVID Cases", main = "7 Day Increase in COVID Cases Forecast")
lines(seq(188, 194,1), inpos.preds7$f, type = "l", col = "red")

  • 90 Day Forecast
#90 day forecast
plot(seq(1,187,1), us$positiveIncrease, type = "l", xlim = c(0,280), ylim = c(0,80000), ylab = "Increase in COVID Cases", main = "90 Day Increase in COVID Cases Forecast")
lines(seq(188, 277,1), inpos.preds90$f, type = "l", col = "red")

Forecast Dependent Variable: MLR w/ Correlated Errors for Currently Hospitalized Patients using Positive Increase

  1. Data prep and recognizing differencing and seasonal components of Currently Hospitalized
#Selecting only those dates with reported current hospitilizations
us <- dplyr::slice(us,56:187)
invisible(us)
#Stationarize Currently Hospitalized Variable
us.diff = artrans.wge(us$hospitalizedCurrently, phi.tr = 1, lag.max = 100)

acf(us.diff)
us.diff.seas = artrans.wge(us.diff,phi.tr = c(0,0,0,0,0,0,1), lag.max = 100)

acf(us.diff.seas)
#Stationarize Increase Positive Variable
inpos.diff = artrans.wge(us$positiveIncrease, phi.tr = 1, lag.max = 100)

acf(inpos.diff)
inpos.diff.seas = artrans.wge(inpos.diff,phi.tr = c(0,0,0,0,0,0,1), lag.max = 100)

acf(inpos.diff.seas)

  1. Fit simple regression model predicting hospitalizedCurrently with positiveIncrease.
mlr.fit = lm(us.diff.seas~inpos.diff.seas, data=cbind(data.frame(us.diff.seas), data.frame(inpos.diff.seas)))
summary(mlr.fit)
## 
## Call:
## lm(formula = us.diff.seas ~ inpos.diff.seas, data = cbind(data.frame(us.diff.seas), 
##     data.frame(inpos.diff.seas)))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7149.3  -554.2    73.4   611.2  6506.7 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)   
## (Intercept)     -13.32614  143.23299  -0.093  0.92603   
## inpos.diff.seas   0.11733    0.04227   2.776  0.00637 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1595 on 122 degrees of freedom
## Multiple R-squared:  0.0594, Adjusted R-squared:  0.0517 
## F-statistic: 7.705 on 1 and 122 DF,  p-value: 0.006375
AIC(mlr.fit)
## [1] 2184.712
acf(mlr.fit$residuals)

plot(mlr.fit$residuals)

  1. Diagnose with aic.wge and see what the p, q would be with residuals from above model
  • Below we can see that our aic.wge function has selected an ARMA(1,2) model for modeling our cmort information.
mlr.phis= aic.wge(mlr.fit$residuals)
mlr.phis
## $type
## [1] "aic"
## 
## $value
## [1] 14.53403
## 
## $p
## [1] 5
## 
## $q
## [1] 2
## 
## $phi
## [1] -0.2465069 -0.4219100  0.3119255  0.2116968  0.2024803
## 
## $theta
## [1] -0.4657537 -0.9566014
## 
## $vara
## [1] 1803071
  1. Now forecast with ARIMA function with phi’s from above coefficients and ARMA(1,2) model,
mlr.fit.arima = arima(us$hospitalizedCurrently, order = c(5,1,2), seasonal = list(order = c(1,0,0), period = 7), xreg = us$positiveIncrease)
## Warning in log(s2): NaNs produced
#AIC(mlr.fit.arima)
  1. Run test to see if data is white noise: confirmed white noise, continue forward
acf(mlr.fit.arima$resid) 

ltest1 = ljung.wge(mlr.fit.arima$resid) 
## Obs -0.006649913 -0.009334226 -0.01070918 0.01827832 -0.02771128 -0.01061308 -0.01123975 0.08781402 -0.05809279 0.07500195 0.05661161 -0.08527475 -0.007553036 0.1098068 0.002427647 -0.010962 -0.04242877 -0.001526285 -0.05612271 -0.03313627 0.02930685 -0.007307643 -0.06056842 -0.03342384
ltest1$pval
## [1] 0.999208
ltest2 = ljung.wge(mlr.fit.arima$resid, K= 48)
## Obs -0.006649913 -0.009334226 -0.01070918 0.01827832 -0.02771128 -0.01061308 -0.01123975 0.08781402 -0.05809279 0.07500195 0.05661161 -0.08527475 -0.007553036 0.1098068 0.002427647 -0.010962 -0.04242877 -0.001526285 -0.05612271 -0.03313627 0.02930685 -0.007307643 -0.06056842 -0.03342384 0.08407797 -0.006875273 -0.002312484 0.06777622 -0.09591675 -0.06198868 -0.01331578 -0.03441461 -0.04091228 0.0270524 0.06615092 0.06538511 -0.09261011 -0.04939741 -0.03187124 -0.04934637 -0.01834029 -0.02031817 -0.001873916 0.02337696 -0.004828854 0.006148926 -0.002455308 0.004810674
ltest2$pval
## [1] 0.9999866
  1. Load the forecasted increase positive in a data frame
  • 7 Day
#7 Day Case Increase
regs7 = data.frame(positiveIncrease = inpos.preds7$f)
invisible(regs7)
  • 90 Day
#90 Day Case Increase
regs90 = data.frame(positiveIncrease = inpos.preds90$f)
invisible(regs90)
  1. Predictions
  • 7 Day
mlr1.preds7 = predict(mlr.fit.arima, newxreg = regs7, n.ahead = 7)
invisible(mlr1.preds7)
  • 90 Day
mlr1.preds90 = predict(mlr.fit.arima, newxreg = regs90, n.ahead =90)
invisible(mlr1.preds90)
  1. Plotted Forecasts
  • 7 Day
plot(seq(1,132,1), us$hospitalizedCurrently, type = "l",  xlim = c(0,140), ylim = c(0,60000), ylab = "Currently Hospitalized COVID Cases", main = "7 Day Forecast for COVID Hospitalized Cases")
lines(seq(133,139,1), mlr1.preds7$pred, type = "l", col = "red")

  • 90 Day
plot(seq(1,132,1), us$hospitalizedCurrently, type = "l",  xlim = c(0,223), ylim = c(0,60000), ylab = "Currently Hospitalized COVID Cases", main = "90 Day Forecast for COVID Hospitalized Cases")
lines(seq(133,222,1), mlr1.preds90$pred, type = "l", col = "red")

  1. ASE
  • 7-Day: 631,469.8
mlr.train7 <- data.frame(hospitalizedCurrently = us$hospitalizedCurrently[1:125], positiveIncrease = us$positiveIncrease[1:125])
mlr.test7 <- data.frame(hospitalizedCurrently = us$hospitalizedCurrently[126:132], positiveIncrease = us$positiveIncrease[126:132])

fit7 = arima(mlr.train7$hospitalizedCurrently, order = c(5,1,2), seasonal = list(order = c(1,0,0), period = 7), xreg = mlr.train7$positiveIncrease)
fit7
## 
## Call:
## arima(x = mlr.train7$hospitalizedCurrently, order = c(5, 1, 2), seasonal = list(order = c(1, 
##     0, 0), period = 7), xreg = mlr.train7$positiveIncrease)
## 
## Coefficients:
##          ar1      ar2      ar3     ar4     ar5      ma1     ma2    sar1
##       1.5532  -0.9556  -0.0329  0.1479  0.1442  -1.3595  0.9998  0.1704
## s.e.  0.0933   0.1667   0.1881  0.1699  0.0981   0.0402  0.0478  0.1017
##       mlr.train7$positiveIncrease
##                            0.0977
## s.e.                       0.0307
## 
## sigma^2 estimated as 1188841:  log likelihood = -1045.66,  aic = 2111.32
next7 = data.frame(positiveIncrease = mlr.test7$positiveIncrease)
next7
##   positiveIncrease
## 1            56971
## 2            63642
## 3            69150
## 4            71027
## 5            75193
## 6            65413
## 7            61713
mlr.7.preds = predict(fit7, newxreg = next7, n.ahead = 7)
mlr.7.preds
## $pred
## Time Series:
## Start = 126 
## End = 132 
## Frequency = 1 
## [1] 57339.81 58363.17 59186.53 59836.05 60501.54 59898.44 59601.92
## 
## $se
## Time Series:
## Start = 126 
## End = 132 
## Frequency = 1 
## [1] 1097.598 1716.760 2415.802 3162.212 3946.967 4790.285 5669.052
ASEmlr7 = mean((mlr.test7$hospitalizedCurrently - mlr.7.preds$pred)^2)
ASEmlr7
## [1] 631469.8
  • 90-Day: 567,209,810
  • It would not be a statistically sound decision to run an ASE for a 90 day forecast. The data set would be trained on 20% of the data and tested on 80% of the data. This is the exact opposite of modeling a statistically sound time series for prediction. For our COVID predictions of severity we will advise to only to produce forecasts and ASE with 7-day forecasts but have produced the ASE for reference below.
mlr.train90 <- data.frame(hospitalizedCurrently = us$hospitalizedCurrently[1:42], positiveIncrease = us$positiveIncrease[1:42])
mlr.test90 <- data.frame(hospitalizedCurrently = us$hospitalizedCurrently[43:132], positiveIncrease = us$positiveIncrease[43:132])

fit90 = arima(mlr.train90$hospitalizedCurrently, order = c(5,1,2), seasonal = list(order = c(1,0,0), period = 7), xreg = mlr.train90$positiveIncrease)
fit90
## 
## Call:
## arima(x = mlr.train90$hospitalizedCurrently, order = c(5, 1, 2), seasonal = list(order = c(1, 
##     0, 0), period = 7), xreg = mlr.train90$positiveIncrease)
## 
## Coefficients:
##          ar1     ar2     ar3     ar4      ar5     ma1     ma2    sar1
##       -0.739  0.4956  0.8542  0.0232  -0.3749  1.0459  0.4434  0.6831
## s.e.   0.331  0.4000  0.2116  0.2977   0.2003  0.3550  0.4646  0.1296
##       mlr.train90$positiveIncrease
##                              0.121
## s.e.                         0.060
## 
## sigma^2 estimated as 1546410:  log likelihood = -352.66,  aic = 725.32
next90 = data.frame(positiveIncrease = mlr.test90$positiveIncrease)
invisible(next90)

mlr.90.preds = predict(fit90, newxreg = next90, n.ahead = 90)
invisible(mlr.90.preds)

ASEmlr90 = mean((us$hospitalizedCurrently[43:132]-mlr.90.preds$pred)^2)
ASEmlr90
## [1] 567209810

Forecast Dependent Variable: MLR w/ Correlated Errors for Currently Hospitalized Patients w/ Positive Increase & Trend

  1. Fit simple regression model predicting hospitalizedCurrently with positiveIncrease and trend
#creating trend
time <- seq(1,124,1)
#fitting model
mlr.fit.t = lm(us.diff.seas~inpos.diff.seas+time, data=cbind(data.frame(us.diff.seas), data.frame(inpos.diff.seas)))
summary(mlr.fit.t)
## 
## Call:
## lm(formula = us.diff.seas ~ inpos.diff.seas + time, data = cbind(data.frame(us.diff.seas), 
##     data.frame(inpos.diff.seas)))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7111.0  -524.5    73.9   617.7  6540.6 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)   
## (Intercept)      32.35112  289.27171   0.112  0.91114   
## inpos.diff.seas   0.11724    0.04244   2.762  0.00663 **
## time             -0.73096    4.01658  -0.182  0.85590   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1601 on 121 degrees of freedom
## Multiple R-squared:  0.05966,    Adjusted R-squared:  0.04412 
## F-statistic: 3.839 on 2 and 121 DF,  p-value: 0.02419
AIC(mlr.fit.t)
## [1] 2186.678
acf(mlr.fit.t$residuals)

plot(mlr.fit.t$residuals)

  1. Diagnose with aic.wge and see what the p, q would be with residuals from above model
  • Below we can see that our aic.wge function has selected an ARMA(5,2) model for modeling our currently hospitalized information.
mlr.phis= aic.wge(mlr.fit.t$residuals)
mlr.phis
## $type
## [1] "aic"
## 
## $value
## [1] 14.53329
## 
## $p
## [1] 5
## 
## $q
## [1] 2
## 
## $phi
## [1] -0.2466126 -0.4220817  0.3118538  0.2119376  0.2027949
## 
## $theta
## [1] -0.4655288 -0.9564592
## 
## $vara
## [1] 1801724
  1. Now forecast with arima function with phi’s from above coefficients and ARMA(1,2) model.
Time <- seq(1,132,1)
mlr.fit.arima.t = arima(us$hospitalizedCurrently, order = c(5,1,2), seasonal = list(order = c(1,0,0), period = 7), xreg = cbind(us$positiveIncrease, Time))
#AIC(mlr.fit.arima.t)
  1. Run test to see if data is white noise; confirmed white noise continue forward.
acf(mlr.fit.arima.t$resid) 

ltest1 = ljung.wge(mlr.fit.arima.t$resid) 
## Obs 0.002267493 0.004743329 0.004085712 0.008584111 0.003613306 0.03592544 -0.02552583 0.08439651 -0.1060931 0.01276058 0.02640347 -0.06061388 0.05268146 0.1458814 0.008052735 -0.04255608 -0.0935536 -0.04208266 -0.0498013 -0.001717023 0.05934226 0.01341049 -0.07137289 -0.07279666
ltest1$pval
## [1] 0.9823161
ltest2 = ljung.wge(mlr.fit.arima.t$resid, K= 48)
## Obs 0.002267493 0.004743329 0.004085712 0.008584111 0.003613306 0.03592544 -0.02552583 0.08439651 -0.1060931 0.01276058 0.02640347 -0.06061388 0.05268146 0.1458814 0.008052735 -0.04255608 -0.0935536 -0.04208266 -0.0498013 -0.001717023 0.05934226 0.01341049 -0.07137289 -0.07279666 0.05320601 -0.01313386 0.01794322 0.1027134 -0.07595104 -0.06199621 -0.03306343 -0.06206906 -0.03944973 0.03988941 0.07572305 0.07297186 -0.1029145 -0.06857141 -0.04403521 -0.04289784 0.005428483 -0.003871671 -7.833593e-05 0.005208028 -0.03085214 -0.01036108 0.00645461 0.02868683
ltest2$pval
## [1] 0.9990337
  1. Load the forecasted increase positive in a data frame
  • 7 Day
#7 Day Case Increase
regs7t = data.frame(cbind(positiveIncrease = inpos.preds7$f, Time = seq(133,139)))
invisible(regs7)
  • 90 Day
#90 Day Case Increase
regs90t = data.frame(cbind(positiveIncrease = inpos.preds90$f, Time = seq(133,222)))
invisible(regs90)
  1. Predictions
  • 7 Day
mlr1.preds7.t = predict(mlr.fit.arima.t, newxreg = regs7t, n.ahead = 7)
invisible(mlr1.preds7.t)
  • 90 Day
mlr1.preds90.t = predict(mlr.fit.arima.t, newxreg = regs90t, n.ahead =90)
invisible(mlr1.preds90.t)
  1. Plotted Forecasts
  • 7 Day
plot(seq(1,132,1), us$hospitalizedCurrently, type = "l",  xlim = c(0,140), ylim = c(0,60000), ylab = "Currently Hospitalized COVID Cases", main = "7 Day Forecast for COVID Hospitalized Cases")
lines(seq(133,139,1), mlr1.preds7.t$pred, type = "l", col = "red")

  • 90 Day
plot(seq(1,132,1), us$hospitalizedCurrently, type = "l",  xlim = c(0,223), ylim = c(0,85000), ylab = "Currently Hospitalized COVID Cases", main = "90 Day Forecast for COVID Hospitalized Cases")
lines(seq(133,222,1), mlr1.preds90.t$pred, type = "l", col = "red")

  1. ASE
  • 7-Day: 1,449,565
mlr.t.train7 <- data.frame(hospitalizedCurrently = us$hospitalizedCurrently[1:125], positiveIncrease = us$positiveIncrease[1:125])
mlr.t.test7 <- data.frame(hospitalizedCurrently = us$hospitalizedCurrently[126:132], positiveIncrease = us$positiveIncrease[126:132])

fit.t7 = arima(mlr.t.train7$hospitalizedCurrently, order = c(5,1,2), seasonal = list(order = c(1,0,0), period = 7), xreg = cbind(mlr.t.train7$positiveIncrease, Time[1:125]))
fit.t7
## 
## Call:
## arima(x = mlr.t.train7$hospitalizedCurrently, order = c(5, 1, 2), seasonal = list(order = c(1, 
##     0, 0), period = 7), xreg = cbind(mlr.t.train7$positiveIncrease, Time[1:125]))
## 
## Coefficients:
##          ar1      ar2      ar3     ar4     ar5      ma1     ma2    sar1
##       1.5478  -0.9545  -0.0374  0.1514  0.1369  -1.3599  1.0000  0.1687
## s.e.  0.0934   0.1663   0.1879  0.1699  0.0988   0.0401  0.0472  0.1016
##       cbind(mlr.t.train7$positiveIncrease, Time[1:125])1
##                                                   0.0979
## s.e.                                              0.0307
##       cbind(mlr.t.train7$positiveIncrease, Time[1:125])2
##                                                 405.4581
## s.e.                                            456.0831
## 
## sigma^2 estimated as 1182658:  log likelihood = -1045.3,  aic = 2112.59
next.t7 = data.frame(positiveIncrease = mlr.t.test7$positiveIncrease, Time = Time[126:132])
next.t7
##   positiveIncrease Time
## 1            56971  126
## 2            63642  127
## 3            69150  128
## 4            71027  129
## 5            75193  130
## 6            65413  131
## 7            61713  132
mlr.t.7.preds = predict(fit.t7, newxreg = next.t7, n.ahead = 7)
mlr.t.7.preds
## $pred
## Time Series:
## Start = 126 
## End = 132 
## Frequency = 1 
## [1] 57428.02 58546.62 59497.98 60294.34 61128.42 60713.12 60628.56
## 
## $se
## Time Series:
## Start = 126 
## End = 132 
## Frequency = 1 
## [1] 1094.813 1707.677 2395.330 3123.203 3885.663 4701.485 5547.871
ASEmlr7 = mean((mlr.t.test7$hospitalizedCurrently - mlr.t.7.preds$pred)^2)
ASEmlr7
## [1] 1449565
  • 90-Day: 2,867,623,021
  • It would not be a statistically sound decision to run an ASE for a 90 day forecast. The data set would be trained on 20% of the data and tested on 80% of the data. This is the exact opposite of modeling a statistically sound time series for prediction. For our COVID predictions of severity we will advise to only to produce forecasts and ASE with 7-day forecasts but have produced the ASE for reference below.
mlr.t.train90 <- data.frame(hospitalizedCurrently = us$hospitalizedCurrently[1:42], positiveIncrease = us$positiveIncrease[1:42])
mlr.t.test90 <- data.frame(hospitalizedCurrently = us$hospitalizedCurrently[43:132], positiveIncrease = us$positiveIncrease[43:132])

fit.t90 = arima(mlr.t.train90$hospitalizedCurrently, order = c(5,1,2), seasonal = list(order = c(1,0,0), period = 7), xreg = cbind(mlr.t.train90$positiveIncrease, Time[1:42]))
fit.t90
## 
## Call:
## arima(x = mlr.t.train90$hospitalizedCurrently, order = c(5, 1, 2), seasonal = list(order = c(1, 
##     0, 0), period = 7), xreg = cbind(mlr.t.train90$positiveIncrease, Time[1:42]))
## 
## Coefficients:
##           ar1     ar2     ar3      ar4      ar5     ma1     ma2    sar1
##       -0.7190  0.5102  0.8233  -0.0017  -0.3785  1.0177  0.4052  0.6661
## s.e.   0.3626  0.4235  0.2419   0.3069   0.1977  0.3973  0.5122  0.1392
##       cbind(mlr.t.train90$positiveIncrease, Time[1:42])1
##                                                   0.1229
## s.e.                                              0.0607
##       cbind(mlr.t.train90$positiveIncrease, Time[1:42])2
##                                                 841.1172
## s.e.                                           1396.5187
## 
## sigma^2 estimated as 1548026:  log likelihood = -352.5,  aic = 727
next.t90 = data.frame(positiveIncrease = mlr.t.test90$positiveIncrease, Time = Time[43:132])
invisible(next.t90)

mlr.t.90.preds = predict(fit.t90, newxreg = next.t90, n.ahead = 90)
invisible(mlr.t.90.preds)

ASEmlr90 = mean((mlr.t.test90$hospitalizedCurrently - mlr.t.90.preds$pred)^2)
ASEmlr90
## [1] 2867623021

Forecast Dependent Variable: MLR w/ Correlated Errors (lagged variables) for Currently Hospitalized Patients w/ Positive Increase & Trend

With a quick check, we can see that there is no lag correlation between the increase of COVID patients and hospitalized patients, like we theorized there might be. So we will not model an MLR w/ correlated errors on lagged variables.

ccf(us$positiveIncrease,us$hospitalizedCurrently)

Multivariate VAR Model

Since we are working with a seasonal and transformation component, but it requires an additional transformation for the total positive COVID cases to become stationary for evaluation, we will only use positive increase of cases to predict currently hospitalized cases for the VAR model.

  1. Differenced and Seasonal Transformation VAR Model
#Positive Cases Transformations
#pos.diff = artrans.wge(us$positive, phi.tr = 1, lag.max = 100)
#pos.diff.2 = artrans.wge(pos.diff, phi.tr = 1, lag.max = 100)
#pos.trans = artrans.wge(pos.diff.2,phi.tr = c(0,0,0,0,0,0,1), lag.max = 100)
#Increase Positive Transformation
inpos.diff = artrans.wge(us$positiveIncrease, phi.tr = 1, lag.max = 100)

inpos.trans = artrans.wge(inpos.diff,phi.tr = c(0,0,0,0,0,0,1), lag.max = 100)

#Current Hospitalization Transformations
us.diff = artrans.wge(us$hospitalizedCurrently, phi.tr = 1, lag.max = 100)

currhosp.trans = artrans.wge(us.diff,phi.tr = c(0,0,0,0,0,0,1), lag.max = 100)

  1. Use VAR select to find best model and fit model: p = 8 for lowest AIC
#VARselect to choose lag
VARselect(cbind(currhosp.trans, inpos.trans), lag.max = 10, type= "both")
## $selection
## AIC(n)  HQ(n)  SC(n) FPE(n) 
##      7      7      2      7 
## 
## $criteria
##                   1            2            3            4            5
## AIC(n) 3.093857e+01 3.083329e+01 3.079091e+01 3.076976e+01 3.069554e+01
## HQ(n)  3.101650e+01 3.095018e+01 3.094676e+01 3.096458e+01 3.092932e+01
## SC(n)  3.113058e+01 3.112131e+01 3.117493e+01 3.124979e+01 3.127158e+01
## FPE(n) 2.731960e+13 2.459304e+13 2.357880e+13 2.309552e+13 2.145770e+13
##                   6            7            8            9           10
## AIC(n) 3.066656e+01 3.051427e+01 3.053424e+01 3.059674e+01 3.059649e+01
## HQ(n)  3.093930e+01 3.082598e+01 3.088492e+01 3.098638e+01 3.102509e+01
## SC(n)  3.133860e+01 3.128233e+01 3.139830e+01 3.155681e+01 3.165257e+01
## FPE(n) 2.086403e+13 1.793906e+13 1.833019e+13 1.955149e+13 1.959499e+13
#fit model
usdiff.var.fit = VAR(cbind(currhosp.trans, inpos.trans), type = "both", p = 2)
#AIC: 30.51427
  1. Predictions for Difference
  • 7 Day
#7 day predictions
pred.var7 = predict(usdiff.var.fit, n.ahead = 7)
pred.var7$fcst$currhosp.trans[,1:3]
##            fcst     lower    upper
## [1,]  -42.12350 -2942.410 2858.163
## [2,] -385.17371 -3346.493 2576.145
## [3,]  -80.65766 -3262.529 3101.214
## [4,] -124.14161 -3327.324 3079.041
## [5,]  -46.62794 -3284.538 3191.282
## [6,]  -43.58380 -3288.207 3201.039
## [7,]  -11.23201 -3262.277 3239.813
  • 90 Day
#90 day predictions
pred.var90 = predict(usdiff.var.fit, n.ahead = 90)
invisible(pred.var90$fcst$currhosp.trans)
  1. Calculate Actual Forecasts from Predicted Differences
  • 7 Day
startingPoints7 = us$hospitalizedCurrently[126:132]
currHospForecasts7 = pred.var7$fcst$currhosp.trans[,1:3] + startingPoints7
  • 90 Day
startingPoints90 = us$hospitalizedCurrently[43:132]
currHospForecasts90 = pred.var90$fcst$currhosp.trans[,1:3] + startingPoints90
  1. Plotting Forecasts
  • 7 Day
#7 day Forecasts
plot(seq(1,132,1), us$hospitalizedCurrently, type = "l", xlim = c(0,139), ylim = c(0,62000), ylab = "Currently Hospitalized COVID Patients", main = "7 Day Currently Hospitalized Patients Forecast")
lines(seq(133,139,1), as.data.frame(currHospForecasts7)$fcst, type = "l", col = "red")
lines(seq(133,139,1), as.data.frame(currHospForecasts7)$upper, type = "l", col = "blue")
lines(seq(133,139,1), as.data.frame(currHospForecasts7)$lower, type = "l", col = "blue")

- 90 Day

#90 day Forecasts
plot(seq(1,132,1), us$hospitalizedCurrently, type = "l", xlim = c(0,222), ylim = c(0,62000), ylab = "Currently Hospitalized COVID Patients", main = "90 Day Currently Hospitalized Patients Forecast")
lines(seq(133,222,1), as.data.frame(currHospForecasts90)$fcst, type = "l", col = "red")
lines(seq(133,222,1), as.data.frame(currHospForecasts90)$upper, type = "l", col = "blue")
lines(seq(133,222,1), as.data.frame(currHospForecasts90)$lower, type = "l", col = "blue")

  1. ASE
  • 7-Day
varASE7 = mean((us$hospitalizedCurrently[126:132]-currHospForecasts7[1:7])^2)
varASE7
## [1] 25178.55
  • 90-Day
  • It would not be a statistically sound decision to run an ASE for a 90 day forecast. The data set would be trained on 20% of the data and tested on 80% of the data. This is the exact opposite of modeling a statistically sound time series for prediction. For our COVID predictions of severity we will advise to only to produce forecasts and ASE with 7-day forecasts but have produced the ASE for reference below.
varASE7 = mean((us$hospitalizedCurrently[43:132]-currHospForecasts90[1:90])^2)
varASE7
## [1] 10791.59
Multilayer Perceptron (MLP) / Neural Network Model

Since we have been looking at additional regressors for all our above models and know that the best regressor to leverage will be positive increase of COVID cases. This regressor reflects similar behavior to that of current hospitalizations and can be model properly against hospitalizations. 1. Create Current Hospitalized ts variable

tUScurrhop = ts(us$hospitalizedCurrently)
  1. Create data frame of regressor: positive increase COVID cases variable
tUSx = data.frame(positiveIncrease = ts(us$positiveIncrease))
  1. Forecast of positive increase of COVID cases
  • 7 Day Forecast for Regressor
fit.mlp.new = mlp(ts(tUSx), reps = 50, comb = "mean")
mlp.new.fore7 = forecast(fit.mlp.new, h = 7)
invisible(mlp.new.fore7)
  • 90 Day Forecast for Regressor
mlp.new.fore90 = forecast(fit.mlp.new, h = 90)
invisible(mlp.new.fore90)
  1. Combine observed new cases + forecast new cases
  • 7 Day regressor var
new.regressor7 <- data.frame(c(us$positiveIncrease, mlp.new.fore7$mean))
invisible(new.regressor7)
  • 90 day regressor var
new.regressor90 <- data.frame(c(us$positiveIncrease, mlp.new.fore90$mean))
invisible(new.regressor90)
  1. Fit model for currently hospitalized w/ regressor of new cases
mlp.fit1 = mlp(tUScurrhop, xreg = tUSx, outplot = T, comb = "mean")

plot(mlp.fit1)

  1. Forecast w/ known Positive Increase Case Variable
  • Currently Hospitalized 7 Day Forecast w/ Positive Increase Regressor
currhosp.fore7 = forecast(mlp.fit1, h = 7, xreg = new.regressor7)
plot(currhosp.fore7)

- Currently Hospitalized 90 Day Forecast w/ Positive Increase Regressor

currhosp.fore90 = forecast(mlp.fit1, h = 90, xreg = new.regressor90)
plot(currhosp.fore90)

  1. ASE
  • 7-Day
#ASEmlr7 = mean((us$hospitalizedCurrently[126:132]-currhosp.fore7$mean)^2)
#ASEmlr7

tUScurrhop2 = ts(us$hospitalizedCurrently[1:125])
tUSx2 = data.frame(positiveIncrease = ts(us$positiveIncrease[1:125]))
fit.mlp.new2 = mlp(ts(tUSx2), reps = 50, comb = "mean")
mlp.new.fore7.2 = forecast(fit.mlp.new2, h = 7)
invisible(mlp.new.fore7.2)
new.regressor7.2 <- data.frame(c(us$positiveIncrease[1:125], mlp.new.fore7.2$mean))
invisible(new.regressor7.2)
mlp.fit1.2 = mlp(tUScurrhop2, xreg = tUSx2, comb = "mean")
#plot(mlp.fit1.2)
currhosp.fore7.2 = forecast(mlp.fit1.2, h = 7, xreg = new.regressor7.2)
#plot(currhosp.fore7.2)
ASEmlr7.2 = mean((us$hospitalizedCurrently[126:132]-currhosp.fore7.2$mean)^2)
ASEmlr7.2
## [1] 721093.4
  • 90-Day
  • It would not be a statistically sound decision to run an ASE for a 90 day forecast. The data set would be trained on 20% of the data and tested on 80% of the data. This is the exact opposite of modeling a statistically sound time series for prediction. For our COVID predictions of severity we will advise to only to produce forecasts and ASE with 7-day forecasts but have produced the ASE for reference below.
#ASEmlr90 = mean((us$hospitalizedCurrently[43:132]-currhosp.fore90$mean)^2)
#ASEmlr90

tUScurrhop3 = ts(us$hospitalizedCurrently[1:42])
tUSx3 = data.frame(positiveIncrease = ts(us$positiveIncrease[1:42]))
fit.mlp.new3 = mlp(ts(tUSx3), reps = 50, comb = "mean")
mlp.new.fore7.3 = forecast(fit.mlp.new3, h = 90)
invisible(mlp.new.fore7.3)
new.regressor7.3 <- data.frame(c(us$positiveIncrease[1:42], mlp.new.fore7.3$mean))
invisible(new.regressor7.3)
mlp.fit1.3 = mlp(tUScurrhop3, xreg = tUSx2, comb = "mean")
#plot(mlp.fit1.3)
currhosp.fore7.3 = forecast(mlp.fit1.3, h = 90, xreg = new.regressor7.3)
#plot(currhosp.fore7.3)
ASEmlr7.3 = mean((us$hospitalizedCurrently[43:132]-currhosp.fore7.3$mean)^2)
ASEmlr7.3
## [1] 4562909805

MLP NN Model Analysis

We completed a default neural network model. With so many opportunities for how to actually tune neural network model we knew this would not be our best model in this case. So we moved forward with a hyper tuned neural network model that allows us to calculate many windowed ASEs and compare those model against each other.

Ensemble Model: Hyper Tuned Neural Network Model

We have leveraged the tswgewrapper code above to produce a hyperparameter tuned NN model for our ensemble model. This model will work as our higher functioning ensemble

#if (!require(devtools)) {
#  install.packages("devtools")
#}
#devtools::install_github("josephsdavid/tswgewrapped", build_vignettes = TRUE)
library(tswgewrapped)
  1. Train/ Test Data Sets
set.seed(3)
data_train.m <- data.frame(hospitalizedCurrently = us$hospitalizedCurrently[1:122], positiveIncrease = us$positiveIncrease[1:122])
data_test.m <- data.frame(hospitalizedCurrently = us$hospitalizedCurrently[123:132], positiveIncrease = us$positiveIncrease[123:132])
  1. Hyper tune parameters
# search for best NN hyperparameters in given grid
model.m = tswgewrapped::ModelBuildNNforCaret$new(data = data_train.m, var_interest = "hospitalizedCurrently",
                                               search = 'random', tuneLength = 5, parallel = TRUE,
                                               batch_size = 50, h = 7, m = 7,
                                               verbose = 1)
## Aggregating results
## Selecting tuning parameters
## Fitting reps = 10, hd = 2, allow.det.season = TRUE on full training set
## - Total Time for training: : 41.571 sec elapsed
  1. The windowed ASEs associated with the grid of hyperparameters is shown in the table and heatmap below.
res.m <- model.m$summarize_hyperparam_results()
res.m
##   reps hd allow.det.season     RMSE      ASE   RMSESD    ASESD
## 1   10  2             TRUE 3103.429 13970241 2184.689 16296534
## 2   11  3            FALSE 3129.341 14942705 2380.109 19673017
## 3   16  1            FALSE 3559.638 16480775 2047.127 15988977
## 4   16  3             TRUE 3204.752 15391189 2373.358 19509361
## 5   22  3            FALSE 3268.191 16196849 2463.200 19979406
model.m$plot_hyperparam_results()

  1. Best Parameters shown in below table. The best hyperparameters based on this grid search are 10 repetitions and 2 hidden layers, and allow.det.season = TRUE .
best.m <- model.m$summarize_best_hyperparams()
best.m
##   reps hd allow.det.season
## 1   10  2             TRUE
  1. Windowed ASE of 13,970,241.
final.ase.m <- dplyr::filter(res.m, reps == best.m$reps &
                    hd == best.m$hd &
                    allow.det.season == best.m$allow.det.season)[['ASE']]
final.ase.m
## [1] 13970241
  1. Ensemble model characteristics and plot
# Final Model
caret_model.m = model.m$get_final_models(subset = 'a')
caret_model.m$finalModel
## MLP fit with 2 hidden nodes and 10 repetitions.
## Series modelled in differences: D1.
## Univariate lags: (1,2,3)
## 1 regressor included.
## - Regressor 1 lags: (1,3)
## Forecast combined using the median operator.
## MSE: 1065689.5459.
# Plot Final Model
plot(caret_model.m$finalModel)

  1. Forecasts
  • Model final with best hyper parameters from above and regressor of positive increase of COVID cases.
#Ensemble model
ensemble.mlp = mlp(tUScurrhop, xreg = tUSx, outplot = T, reps = 10, hd = 2, allow.det.season = T)

ensemble.mlp
## MLP fit with 2 hidden nodes and 10 repetitions.
## Series modelled in differences: D1.
## Univariate lags: (1,2,3)
## 1 regressor included.
## - Regressor 1 lags: (1,3)
## Forecast combined using the median operator.
## MSE: 1025155.9706.
#Plot ensemble model
plot(ensemble.mlp)

  • 7 Day
fore7.m = forecast(ensemble.mlp, xreg = new.regressor7, h=7)
#grabbing prediction intervals for 7 day forecast
all.mean.m7 <- data.frame(fore7.m$all.mean)
ranges.m7 <- data.frame(apply(all.mean.m7, MARGIN = 1, FUN = range))
subtracts.m7 <- ranges.m7 - as.list(ranges.m7[1,])
nintyperc.m7 <- data.frame(mapply(`*`,subtracts.m7,.9,SIMPLIFY=FALSE))
diffs.m7 <- data.frame(mapply(`/`,nintyperc.m7,2,SIMPLIFY = FALSE))
diffs.m7 = diffs.m7[-1,]
vector.m7 <-  as.numeric(diffs.m7[1,])

plot(fore7.m)
lines(seq(133,139,1), (fore7.m$mean + vector.m7), type = "l", col = "red")
lines(seq(133,139,1), (fore7.m$mean - vector.m7), type = "l", col = "red")

  • 90 Day
fore90.m = forecast(ensemble.mlp, xreg = new.regressor90, h=90)
#grabbing prediction intervals for 90 day forecast
all.mean.m90 <- data.frame(fore90.m$all.mean)
ranges.m90 <- data.frame(apply(all.mean.m90, MARGIN = 1, FUN = range))
subtracts.m90 <- ranges.m90 - as.list(ranges.m90[1,])
nintyperc.m90 <- data.frame(mapply(`*`,subtracts.m90,.9,SIMPLIFY=FALSE))
diffs.m90 <- data.frame(mapply(`/`,nintyperc.m90,2,SIMPLIFY = FALSE))
diffs.m90 = diffs.m90[-1,]
vector.m90 <-  as.numeric(diffs.m90[1,])

plot(fore90.m)
lines(seq(133,222,1), (fore90.m$mean + vector.m90), type = "l", col = "red")
lines(seq(133,222,1), (fore90.m$mean - vector.m90), type = "l", col = "red")

  1. ASE
  • 7-Day
ASEmlr7 = mean((us$hospitalizedCurrently[126:132]-fore7.m$mean)^2)
ASEmlr7
## [1] 2174937
  • 90-Day
  • It would not be a statistically sound decision to run an ASE for a 90-day forecast. The data set would be trained on 20% of the data and tested on 80% of the data. This is the exact opposite of modeling a statistically sound time series for prediction. For our COVID predictions of severity we will advise to only to produce forecasts and ASE with 7-day forecasts. This point is proven by the extremely high ASE calculated below.
ASEmlr90 = mean((us$hospitalizedCurrently[43:132]-fore90.m$mean)^2)
ASEmlr90
## [1] 2239813584

Multivariate Model Analysis

We started our multivariate analysis using multiple regression with correlated errors models. We ended up producing two models, one with and without a trend. We predicted that a trend (or time) would be a deciding variable in which of these models would be outperform the other. However, we expected trend to be the better model. When we compared the two via their AICs, we found the MLR model without trend performed better both with an AIC and when we compared the ASEs between the two model types. When applying our domain knowledge, we did come to the conclusion that time would not necessarily be a strong determinant for the severity of COVID since we have yet to observe, and subsequently able to analyze, a full cycle of the virus effect on the US population. Once we are able to analyze data to this level, we would expect a time correlation and therefore lag/trend model performing better in the predictions of the virus’s severity and therefore hospitalized patients.

Next we completed a Vector AR (VAR) model analysis of our data. This modeling processes requires the data of both the independent and dependent variables to be stationary. Which means we are actually modeling on the residuals of the data. This also results in the predictions/forecasts being based on differences of the forecasts and therefore the forecasts have to be calculated based on the previous periods values. This modeling techniques is highly sensitive to the previously observed values, which in the case of COVID is essential for understanding the severity of COVID. Since we have yet to observe a full cycle the modeling for predicting hospitalized cases should be closely based on what we’ve previously observed. So far this is the better of the three models we’ve produced.

For our final 2 models we performed multilayered perceptron or neural network model. For our first model we used a mostly default hyper parameters for the model. The ASEs returned with cross validation are much higher than the VAR model. We see it in the billions for the 90-day window forecast. This is as expected since the NN model creates multiple scenarios and takes the average of those in order to forecast moving forward. This means even the highest and lowest are incorporated into the forecast. While some of the scenarios are not statistically likely the NN model is attempting to create a forecast that takes these possibilities into account. In order to help created a better neural network we created a hyper parameter turned model. This model produced ASEs much lower than then default parameter neural network model. The forecasts for the hyper tuned NN model are slightly less dispersed than the default NN model telling us that these predictions are more helpful. Hyper tuning the additional parameters allows the model to be closer in predicting future hospitalized cases. We performed windowed ASEs in order to pick the best hyper tuned parameters for our advanced neural network model.

Upon reviewing our 90-day forecasts and ASEs we have concluded it would not be a statistically sound decision to run a 90-day forecast and allow it to make long term decisions until we have more data. Upon performing 90-day forecasts we end up having to train the model on 20% of the data and test it on 80% of the data. This is the exact opposite of modeling a statistically sound time series for prediction. For our COVID predictions of severity we will advise to only to produce forecasts and ASE with 7-day forecasts but have produced the ASE for reference below. This applies to all of our model’s cross validation efforts and ASE calculations.

We have chosen two models to help us predict severity with the COVID hospitalized. We will leverage the vector AR model for short term forecasts This model allows us to base our predictions off of previously observed values. Since COVID has not completed a full cycle and has shown heavy wandering behavior in the realization, we feel using this method will be the closest in order to get us prepared for forecasts for our 7-day and doing short term planning for the hospitalizes supplies and staffing with prediction intervals to help us make sure we account for possible peaks and valleys in our forecast.

For our long term forecasting we have chosen to go with our hyper tuned parameter neural network model. This model is helpful for long term forecasting because it has the ability to adjust and reapply the most effective model based on the newest data input with daily updates. We’ve calculated 90% confidence intervals based on the range of the forecasted possibilities. We will continue to calculate all of the probably outcomes and produce a mean forecast for us to base all planning. The hyper parameter turned neural network model takes into account multiple possibilities and therefore does give us the most statistically useful forecast for our 90-day period.

It is essential both of these models be updated daily. COVID cases and hospitalizations changes frequently and the only way to continue to forecast and prepare as effectively as possible is to have updated models with as much historical data being taken into account as possible. As we move forward with presenting these models we’ve deicded on, it is important to remember these just appear to be the most useful models, and while we can work from them, none of them will be correct.

#VAR 7-Day Forecasts w/ Prediction Intervals
plot(seq(1,132,1), us$hospitalizedCurrently, type = "l", xlim = c(0,139), ylim = c(0,62000), ylab = "Currently Hospitalized COVID Patients", main = "7 Day Currently Hospitalized Patients Forecast")
lines(seq(133,139,1), as.data.frame(currHospForecasts7)$fcst, type = "l", col = "red")
lines(seq(133,139,1), as.data.frame(currHospForecasts7)$upper, type = "l", col = "blue")
lines(seq(133,139,1), as.data.frame(currHospForecasts7)$lower, type = "l", col = "blue")

#Hyper Tuned Parameter Neural Network Model Forecast w/ Prediction Intervals
plot(fore90.m)
lines(seq(133,222,1), (fore90.m$mean + vector.m90), type = "l", col = "red")
lines(seq(133,222,1), (fore90.m$mean - vector.m90), type = "l", col = "red")

Multivariate Model Performance Breakdown

Multiple Regression w/ Correlated Errors w/o Trend

  • AIC: 2184.712
  • 7-Day ASE
    • 631,469.8
  • 90-Day ASE
    • 567,209,810

Multiple Regression w/ Correlated Errors w/ Trend

  • AIC: 2186.678
  • 7-Day ASE
    • 1,449,565
  • 90-Day ASE
    • 2,867,623,021

Vector AR Model

  • 7-Day ASE
    • 25,178.55
  • 90-Day ASE
    • 10,791.59

Multilayered Perceptron / Neural Network Model

  • 7-Day ASE
    • 4,068,222
  • 90-Day ASE
    • 17,516,230,847

Ensemble Model: Hyper Tuned Neural Network Model

  • 7-Day ASE
    • 2,126,994
  • 90-Day ASE
    • 2,239,014,808
  • 7-Day Windowed ASE
    • 13,970,241

California COVID Data

EDA: California

CA <- read.csv("https://raw.githubusercontent.com/JaclynCoate/6373_Time_Series/master/TermProject/Data/CA_COVID_7.16.20.csv", header = T)
#Re-format Date
CA$date <- as.Date(CA$date, format = "%m/%d/%Y")
head(CA)
##         date newtested testedtotal newcountconfirmed totalcountconfirmed
## 1 2020-03-18      6291        6291               675                 675
## 2 2020-03-19      5196       11487               331                1006
## 3 2020-03-20      1041       12528               218                1224
## 4 2020-03-21       939       13467               244                1468
## 5 2020-03-22       850       14317               265                1733
## 6 2020-03-23      1237       15554               222                1955
##   newpospercent pospercent_14dayavg newcountdeaths totalcountdeaths
## 1          10.7                10.7             15               15
## 2           6.4                 8.8              3               18
## 3          20.9                 9.8              4               22
## 4          26.0                10.9              4               26
## 5          31.2                12.1              9               35
## 6          17.9                12.6              2               37
##   hospitalized_covid_confirmed_patients
## 1                                     0
## 2                                     0
## 3                                     0
## 4                                     0
## 5                                     0
## 6                                     0
##   hospitalized_suspected_covid_patients hospitalized_covid_patients
## 1                                     0                           0
## 2                                     0                           0
## 3                                     0                           0
## 4                                     0                           0
## 5                                     0                           0
## 6                                     0                           0
##   all_hospital_beds icu_covid_confirmed_patients
## 1                 0                            0
## 2                 0                            0
## 3                 0                            0
## 4                 0                            0
## 5                 0                            0
## 6                 0                            0
##   icu_suspected_covid_patients icu_available_beds
## 1                            0                  0
## 2                            0                  0
## 3                            0                  0
## 4                            0                  0
## 5                            0                  0
## 6                            0                  0

California: Available Variables

  1. Date - Official reporting began 3/18/20. Hospitalization reporting began 3/29/20.
  2. newtested - New tests each day
  3. testedtotal - Cumulative total tests
  4. newcountconfirmed - New positive tests each day
  5. totalcountconfirmed - Cumulative total positive tests
  6. newpospercent - Positive Percent: New daily positive tests divided by new daily tests
  7. pospercent_14dayavg - Rolling average of last 2 weeks of positive percent
  8. newcountdeaths - New deaths each day of confirmed cases
  9. totalcountdeaths - Cumulative total deaths
  10. hospitalized_covid_confirmed_patients - Currently hospitalized patients with positive tests
  11. hospitalized_suspected_covid_patients - Currently hospitalized patients with symptoms but not tested
  12. hospitalized_covid_patients - Hospitalized patients with confirmed + suspected cases
  13. all_hospital_beds - Total available hospital beds
  14. icu_covid_confirmed_patients - Patients with positive tests in intensive care
  15. icu_suspected_covid_patients - Patients with symptoms but not tested in intensive care
  16. icu_available_beds - Total available intensive care unit beds

California: Plots of Cumulative Totals

#Cumulative total Cases
ggplot(data=CA, aes(x=date, y=totalcountconfirmed, group=1)) + 
  geom_line(color="gold") + ggtitle("Cumulative Total COVID-19 Confirmed Cases in CA") + 
  scale_x_date(date_labels = "%b") + xlab("") + ylab("Count")+ theme_fivethirtyeight()

#Cumulative total Tests
ggplot(data=CA, aes(x=date, y=testedtotal, group=1)) + 
  geom_line(color="green2") + ggtitle("Cumulative Total COVID-19 Tests in CA") + 
  scale_x_date(date_labels = "%b") + xlab("") + ylab("Count")+ theme_fivethirtyeight()

#Cumulative total Deaths
ggplot(data=CA, aes(x=date, y=totalcountdeaths, group=1)) + 
  geom_line(color="darkred") + ggtitle("Cumulative Total COVID-19 Related Deaths in CA") + 
  scale_x_date(date_labels = "%b") + xlab("") + ylab("Count")+ theme_fivethirtyeight()

California: Hospitalized Patients

As stated in the analysis of the US data, the number of hospitalized patients is a valuable metric for determining the impact of the virus over time as it represents the primary driving factor for policy making. is a The “hospitalized_covid_patients” category was not completely populated for the time frame that both confirmed and suspected hospitalizations data was available, so a new variable is created that calculates total hospitalized covid patients (confirmed + suspected).

Totalhosp <- (CA$hospitalized_covid_confirmed_patients + CA$hospitalized_suspected_covid_patients)
colors <- c("Confirmed COVID Patients" = "red", "Confirmed + Suspected COVID Patients" = "orange")
ggplot(data=CA) + 
  geom_line(data=CA,aes(x=date, y=hospitalized_covid_confirmed_patients, color="Confirmed COVID Patients")) + 
  geom_line(data=CA,aes(x=date, y=Totalhosp, color="Confirmed + Suspected COVID Patients")) +
  ggtitle("Hospitalized COVID-19 Patients in CA") + 
  scale_x_date(date_labels = "%b") + labs(x="", y="Patients", color = "") +scale_color_manual(values = colors) +
  theme_fivethirtyeight()

Because the availability of tests has increased over time, using the total hospitalized patients (confirmed + suspected) might be the best representation even though there is a possibility that some suspected cases may not actually be COVID-related.

How does Hospitalization compare to new cases? New cases might be valuable in predicting hospitalization, or the realtionship between them could be informative in terms of the impact of the virus. Based on the plot below, there is an interesting pattern of cases making its way above the hospitalized patients curve, but it is reasonable to assume that currently hospitalizations rise as the number of daily new cases rises.

colors <- c("Confirmed COVID Hospital Patients" = "red", "New positive cases" = "orange")
ggplot(data=CA) + 
  geom_line(data=CA,aes(x=date, y=hospitalized_covid_confirmed_patients, color="Confirmed COVID Hospital Patients")) + 
  geom_line(data=CA,aes(x=date, y=newcountconfirmed, color="New positive cases")) +
  ggtitle("Hospitalized COVID-19 Patients vs New Cases in CA") + 
  scale_x_date(date_labels = "%b") + labs(x="", y="Patients", color = "") +scale_color_manual(values = colors) +
  theme_fivethirtyeight()

Univariate AR/ARMA modeling

  1. Original Realization Analysis Traits:
  • dramatic increasing trends early and late in the data
  • very little apparent periodicity
  1. Spectral Density and ACF plots
parzen.wge(Totalhosp)

There are no non-zero peaks on the spectral density plot that are particularly revealing. We may be interested later in only modeling the most recent upward trend (late June to present), so we can also check for any underlying cycles in that portion of the series by itself.

Totalhosp2 <- Totalhosp[90:119] #This is only the hospitalized patients data from late June to present
parzen.wge(Totalhosp2)

There does not appear to be any cyclic behavior hiding in this time frame either.

acf(Totalhosp,lag.max = 50)

The ACF plot is just a reflection of what the realization showed us; slowly damping at the beginning due to the increasing trend, and then flattening out for a couple months before increasing again.

  1. Diagnose Model w/ aic.wge

For a stationary model, we can start by trying the top 2 recommendations from the aic.wge function.

aic5.wge(Totalhosp)
## ---------WORKING... PLEASE WAIT... 
## 
## 
## Five Smallest Values of  aic
##       p    q        aic
## 7     2    0   11.98731
## 6     1    2   11.99646
## 10    3    0   12.00066
## 8     2    1   12.00122
## 5     1    1   12.00350
#aic5.wge(Totalhosp, type = "bic") this also resulted in recommending AR2

The top two models were an AR(2) or an ARMA(1,2)

  1. Estimate phis and thetas
#AR(2)
Hosp_est <- est.ar.wge(Totalhosp, p=2, type = "burg")
## 
## Coefficients of Original polynomial:  
## 1.2538 -0.2885 
## 
## Factor                 Roots                Abs Recip    System Freq 
## 1-0.9502B              1.0525               0.9502       0.0000
## 1-0.3037B              3.2930               0.3037       0.0000
##   
## 
#(ARMA)1,2
Hosp_estq <- est.arma.wge(Totalhosp, p=1,q=2) #AR component has a root very close to 1
## 
## Coefficients of Original polynomial:  
## 0.9779 
## 
## Factor                 Roots                Abs Recip    System Freq 
## 1-0.9779B              1.0226               0.9779       0.0000
##   
## 

AIC of two stationary models above

Hosp_est$aic
## [1] 11.97533
Hosp_estq$aic
## [1] 11.99646

The AIC for the parameters of an AR(2) vs ARMA(1,2) model are nearly identical. A rolling window ASE can be used to provide another comparison of the two models.

  1. Rolling Window ASE evaluation of 2 stationary models

Model 1: AR(2)

trainingSize = 30
horizon = 7
ASEHolder = numeric()
for( i in 1:(119-(trainingSize + horizon) + 1))
{
  
  forecasts = fore.aruma.wge(Totalhosp[i:(i+(trainingSize-1))],phi = Hosp_est$phi, n.ahead = horizon)
  
  ASE = mean((Totalhosp[(trainingSize+i):(trainingSize+ i + (horizon) - 1)] - forecasts$f)^2)
         
  ASEHolder[i] = ASE
}
ASEHolder
hist(ASEHolder)
WindowedASE = mean(ASEHolder)
summary(ASEHolder)
WindowedASE
## [1] 228959.4

Model 2: ARMA(1,2)

trainingSize = 30
horizon = 7
ASEHolder = numeric()
for( i in 1:(119-(trainingSize + horizon) + 1))
{
  
  forecasts = fore.aruma.wge(Totalhosp[i:(i+(trainingSize-1))],phi = Hosp_estq$phi, theta = Hosp_estq$theta, n.ahead = horizon)
  
  ASE = mean((Totalhosp[(trainingSize+i):(trainingSize+ i + (horizon) - 1)] - forecasts$f)^2)
         
  ASEHolder[i] = ASE
}
ASEHolder
hist(ASEHolder)
WindowedASE = mean(ASEHolder)
summary(ASEHolder)
WindowedASE
## [1] 178260.1

The ARMA(1,2) model has a lower ASE when using the windowed ASE method to compare the 2 models. The windowed ASE might be a better indicator because the behavior of the data changes over the course of the realization. This model also has some appeal to its behavior of increasing for a couple points before making its decline.

Univariate ARMA(1,2) Forecasts

Short Term

#one week forecast of ARMA(1,2) model
fq <- fore.arma.wge(Totalhosp, phi = Hosp_estq$phi, theta = Hosp_estq$theta, n.ahead = 7, lastn = FALSE, limits = FALSE)

Long Term

#3 month Forecast of ARMA(1,2) model
fq <- fore.arma.wge(Totalhosp, phi = Hosp_estq$phi, theta = Hosp_estq$theta, n.ahead = 90, lastn = FALSE, limits = FALSE)

The parameters of the ARMA(2,1) model:

Hosp_estq$phi #phi coefficients
## [1] 0.9779335
Hosp_estq$theta #theta coefficients
## [1] -0.2919636 -0.1422741
Hosp_estq$avar #white noise variance estimate
## [1] 151635.6
mean(Totalhosp) #mean
## [1] 4722.504

What this means to us: Stationary models like these are useful only if it is believed that the count is going to (almost) immediately begin its decline toward the mean of the data we have (the ARMA(1,2) model has a slight increase first). Whether an imminent decline is likely or not is anyone’s guess, but based on the apparent trend in the short term, we might not consider that to be realistic for a 7 day forecast. What we do know is that the count does have to come down at some point, and for a 90 day forecast this model could be a guess that takes into account our knowledge of this assumption of eventual decrease. It is not unreasonable to assume that the number of hospitalizations will make its way down to the average number seen in the previous few months before eventually falling further later.

California: Univariate MLP / Neural Network Model

Using an mlp to model hospitalizations could be useful for creating a forecast that is not based on our perceptions of the data formed through the EDA. It might reveal behavioral possibilities we did not consider. The downside is that the mlp will not be “aware” of our expactations for the future, such as an inevitable decline. We can start with a univariate model with defualt mlp function hyperparameters on the whole dataset and form an ensemble of the mean of 50 repetitions.

  1. Create time series objects
Totalhosp_ts <- ts(Totalhosp) #create time series object
  1. Fit the mlp model
fit_mlp_uni = mlp(Totalhosp_ts, reps = 50, comb = "mean") #mlp fit
fit_mlp_uni
## MLP fit with 5 hidden nodes and 50 repetitions.
## Series modelled in differences: D1.
## Univariate lags: (1)
## Forecast combined using the mean operator.
## MSE: 155666.1615.
plot(fit_mlp_uni) #model representation

Short Term Forecast

fore_mlp_uni7 <- forecast(fit_mlp_uni, h=7) #univariate 1-week mlp forecast
plot(fore_mlp_uni7)

Somewhat unsurprisingly due to the fairly constant trend with little noise at the end of the data, the mlp formed models that were essentially linear.

Long Term forecast

fore_mlp_uni90 <- forecast(fit_mlp_uni, h=90) #univariate 3 month mlp forecast
plot(fore_mlp_uni90)

The 90 day forecast shows that there is no representation of our expected decrease in hospitalizations, so this model is likely only useful for a short term forecast.

Tuning Hyperparameters for optimized MLP

Since all of the models used to form the ensemble appeared linear, there is little reason to believe that changing the hyperparameters will have much, if any, effect. However it was attempted to see if any models could be gerenated that declined back toward the mean, but the mlp did not accomplish this.

library(tswgewrapped)
data_train.u <- data.frame(TotalHosp_wase = Totalhosp[1:99], positiveIncrease = rnorm(99, 0, .0001))
data_test.u <- data.frame(TotalHosp_wase = Totalhosp[100:119], positiveIncrease = rnorm(20, 0, .0001))
# search for best NN hyperparameters in given grid
set.seed(1234)
model.u = tswgewrapped::ModelBuildNNforCaret$new(data = data_train.u, var_interest = "TotalHosp_wase", search = 'random', tuneLength = 5, parallel = TRUE, batch_size = 50, h = 7, verbose = 1)
## Aggregating results
## Selecting tuning parameters
## Fitting reps = 16, hd = 1, allow.det.season = FALSE on full training set
## - Total Time for training: : 23.955 sec elapsed
  • The ASEs associated with the grid of hyperparameters is shown in the table below.
res.u <- model.u$summarize_hyperparam_results()
res.u
##   reps hd allow.det.season     RMSE      ASE   RMSESD    ASESD
## 1   10  2             TRUE 283.6218 115188.8 199.2771 159657.8
## 2   11  3            FALSE 281.9886 111393.4 190.8652 145727.3
## 3   16  1            FALSE 279.7952 108781.1 186.6877 142143.2
## 4   16  3             TRUE 282.1774 112411.7 193.5756 150470.4
## 5   22  3            FALSE 279.5324 109693.8 189.9035 144260.2
  • Windowed ASE: The best hyperparameters:
best.u <- model.u$summarize_best_hyperparams()
best.u
##   reps hd allow.det.season
## 3   16  1            FALSE

The ASE of the model using these hyperparameters is shown below:

final.ase.u <- dplyr::filter(res.u, reps == best.u$reps &
                    hd == best.u$hd &
                    allow.det.season == best.u$allow.det.season)[['ASE']]
final.ase.u
## [1] 108781.1
# Final Model
caret_model.u = model.u$get_final_models(subset = 'a')
caret_model.u$finalModel
## MLP fit with 1 hidden node and 16 repetitions.
## Univariate lags: (1,2)
## Forecast combined using the median operator.
## MSE: 170416.9004.
  • The final mlp model based on the tuned hyperparameters:
#Plot Final Model
plot(caret_model.u$finalModel)

mlp_uni2_90 <- mlp(Totalhosp_ts, reps = 16, hd = 1, allow.det.season = FALSE)
fore_mlp_uni2_7 <- forecast(mlp_uni2_90 , h=7) #univariate 7 day mlp forecast
plot(fore_mlp_uni2_7)

After using the tool to tune the hyperparameters, the forecast produced from the model was nearly (in appearance/simplicity) the same.

Again, it is essentially impossible to say which model is “better” because our expectations for the future do not necessarily represent the past behavior of the data. Without some form of biased input, generated models are not going to take into account what we expect to happen in long term. For a short term model, the mlp had the lowest ASE so we could call that the “best” based on that metric.

California: Multiple Linear Regression with Correlated Errors

  1. Create a new variable - 7 Day Average of New Positive Cases

Since we are predicting Hospitalizations, which we determined has no notable seasonality in the CA data, the number of new cases could be transformed into a more useful predictor by calculating the average of the past 7 days.

newcases_7dayavg <- zoo::rollmean(CA$newcountconfirmed, k=7, align = "right")
CA$newcases_7dayavg <- c(CA$newcountconfirmed[1:6],newcases_7dayavg)
#Replot - Avg of last 7 days New Confirmed Cases
ggplot(data=CA, aes(x=date, y=CA$newcases_7dayavg, group=1)) + 
  geom_line(color="gold2") + ggtitle("Last 7 Day Avg Confirmed COVID-19 Cases in CA") + 
  scale_x_date(date_labels = "%b") + xlab("") + ylab("Count") +           theme_fivethirtyeight()

colors <- c("Confirmed COVID Hospital Patients" = "red", "New positive cases" = "orange")
ggplot(data=CA) + 
  geom_line(data=CA,aes(x=date, y=hospitalized_covid_confirmed_patients, color="Confirmed COVID Hospital Patients")) + 
  geom_line(data=CA,aes(x=date, y=CA$newcases_7dayavg, color="New positive cases")) +
  ggtitle("Hospitalized COVID-19 Patients vs 7 Day Avg New Cases in CA") + 
  scale_x_date(date_labels = "%b") + labs(x="", y="Patients", color = "") +scale_color_manual(values = colors) +
  theme_fivethirtyeight()

  1. Forecast New Positive Cases

The MLR with correlated errors model is calculated using the smoothed version of the new cases data. We first need to forecast new cases, so that we can include them in our prediction of the hospitalizations forecast.

#Cut out the zeros at the beginning of Totalhosp and create equal length variable for new cases that lines up with hospital data
Totalhosp1 <- Totalhosp[12:119]
newcases_7dayavg1 <- CA$newcases_7dayavg[5:112]

New cases shows a similar trend to hospitalizations in the last month. To maintain consistency, we’ll use only the last month of data as done in the non-stationary modeling of hospitalizations, and create a model that continues the increasing trend.

#Use only approx last month of new case data
newcases_7dayavg2 <- newcases_7dayavg[83:108]
#Model New Cases
parzen.wge(newcases_7dayavg2) #Data is smoothed by averaging to remove seasonality - parzen plot is more evidence of successful transformation

## $freq
##  [1] 0.03846154 0.07692308 0.11538462 0.15384615 0.19230769 0.23076923
##  [7] 0.26923077 0.30769231 0.34615385 0.38461538 0.42307692 0.46153846
## [13] 0.50000000
## 
## $pzgram
##  [1]   6.8399258   4.5668340   0.7981836  -3.7688205  -6.8129332
##  [6]  -8.0822166  -9.3866117 -10.9297358 -12.2318601 -12.7297715
## [11] -12.6000650 -12.3819517 -12.2935773
#difference the data twice for the increasing linear trend
newcases_7dayavg2_d1 <- artrans.wge(newcases_7dayavg2, phi.tr = 1)

newcases_7dayavg2_d2 <- artrans.wge(newcases_7dayavg2_d1, phi.tr = 1)

#check for white noise
acf(newcases_7dayavg2_d2) #appears to be white noise after second difference
lj10 <- ljung.wge(newcases_7dayavg2_d2, K=10)
## Obs -0.3746032 -0.261266 0.1253135 -0.04341744 0.1014342 0.1564863 -0.3524986 0.2215689 -0.04991159 -0.1646908
lj10$pval
## [1] 0.1234309
lj20 <- ljung.wge(newcases_7dayavg2_d2, K=20)
## Obs -0.3746032 -0.261266 0.1253135 -0.04341744 0.1014342 0.1564863 -0.3524986 0.2215689 -0.04991159 -0.1646908 0.2797824 -0.06330549 -0.1231133 0.02580619 -0.009300202 0.02385994 0.06879511 -0.1053571 0.04606625 -0.02904691
lj20$pval
## [1] 0.3332923
#both ljung-box tests fail to reject H0 of white noise after d=2. No need to fit a stationary component to the model
#The final model is a simple ARIMA(0,0,0) with d=2
f_cases <- fore.aruma.wge(newcases_7dayavg1, d=2, n.ahead = 7, lastn = FALSE, limits = FALSE)

It might not make much sense to include new cases forecast as a predictor when the model used to produce the forecast is a simple line. It likely defeats the purpose of including a second variable. We could also try an mlp model to see if it comes up with something worth using as an exogenous variable - an ensemble calculated from the mean of generated mlp models may be the best way to come up with a short term new cases forecast that isnt biased by expectations.

newcases_7dayavg2_ts <- ts(newcases_7dayavg2) #create time series object
fit_mlp_cases = mlp(newcases_7dayavg2_ts, reps = 50, comb = "mean") #mlp fit
fit_mlp_cases
## MLP fit with 5 hidden nodes and 50 repetitions.
## Series modelled in differences: D1.
## Univariate lags: (2)
## Forecast combined using the mean operator.
## MSE: 30129.9382.
fore_mlp_cases <- forecast(fit_mlp_cases, h=7) #univariate 1-week mlp forecast
plot(fore_mlp_cases)

This forecast looks like a good one to use in the hospitalizations forecast.

The next step is to generate the MLR model that predicts hospitalizations.

  1. Check for lag between hospitalizations and new cases
ccf(newcases_7dayavg1, Totalhosp1) #no lagging needed based on ccf plot

  1. Fit a linear model predicting hospitalized patients
mlr_fit <- lm(Totalhosp1~newcases_7dayavg1)
  1. View the residuals of the linear model
plot(mlr_fit$residuals)

  1. Fit the residuals
aic5.wge(mlr_fit$residuals) 
## ---------WORKING... PLEASE WAIT... 
## 
## 
## Five Smallest Values of  aic
##       p    q        aic
## 5     1    1   10.44648
## 10    3    0   10.45518
## 7     2    0   10.45564
## 8     2    1   10.46386
## 6     1    2   10.46571
#low p/q models as expected. The top pick of ARMA(1,1) should be reasonable
fit1 = arima(Totalhosp1, order=c(1,0,1), xreg=newcases_7dayavg1)
fit1
## 
## Call:
## arima(x = Totalhosp1, order = c(1, 0, 1), xreg = newcases_7dayavg1)
## 
## Coefficients:
##          ar1     ma1  intercept  newcases_7dayavg1
##       0.9804  0.3141   5037.298             0.1824
## s.e.  0.0257  0.0993   1002.218             0.1369
## 
## sigma^2 estimated as 30755:  log likelihood = -713.22,  aic = 1436.43

The AIC of this model is 1436.

  1. Check the residuals of final model
plot(fit1$residuals)

acf(fit1$residuals) 

lj24 <- ljung.wge(fit1$residuals, p=1, q=1)
## Obs -0.1255186 0.005970606 0.04022545 0.01803424 0.05116252 0.09885327 0.1164257 0.1947251 -0.06844523 -0.1223411 0.01329362 0.05266574 -0.03984679 0.1887202 -0.0003049872 0.03184258 -0.08813581 0.06042104 -0.1355143 0.09363089 0.1272651 0.09310402 -0.06917536 -0.02345276
lj24$pval
## [1] 0.2360875
lj48 <- ljung.wge(fit1$residuals, p=1, q=1, K=48)
## Obs -0.1255186 0.005970606 0.04022545 0.01803424 0.05116252 0.09885327 0.1164257 0.1947251 -0.06844523 -0.1223411 0.01329362 0.05266574 -0.03984679 0.1887202 -0.0003049872 0.03184258 -0.08813581 0.06042104 -0.1355143 0.09363089 0.1272651 0.09310402 -0.06917536 -0.02345276 -0.05108337 -0.02155796 0.009357193 0.1167107 -0.004757744 0.004798365 -0.1823881 -0.003576192 -0.04661079 -0.004224437 0.02325559 0.1101438 -0.1428314 0.02463392 -0.08457913 -0.004651226 -0.0328382 0.03049804 0.06123975 0.04165659 -0.1045938 -0.1328497 -0.02459915 -0.03028285
lj48$pval
## [1] 0.380569

The acf does not show significant autocorrelation and the Ljung-Box tests failed to reject H0 that the residuals are white noise.

Forecast Hospitalizations with MLR w/ Correlated Errors with 7 Day Avg New Cases

next7 = data.frame(new_cases_avg = fore_mlp_cases$mean)
f_mlr <- predict(fit1, newxreg = next7, n.ahead = 7)
plot(seq(1,108,1), Totalhosp1, type = "l",xlim = c(0,115),ylim=c(4000,9000), xlab="days", ylab = "COVID-Related Hospitalized Patients", main = "7 Day Forecast - Linear Regression with Corr Errors Model")
lines(seq(109,115,1), f_mlr$pred, type = "l", col = "red")

Add time as a Variable in MLR

Based on previous analysis we obviously believe that the data is dependent on time (there is a trend) so time should be added to the linear model as a variable. We can compare the aic to the previous model with time excluded.

#fit linear model for predicting hospitalization, including time as a variable
Time <- seq(1,108,1)
tmlr_fit <- lm(Totalhosp1~newcases_7dayavg1+Time)
plot(tmlr_fit$residuals)

#fit residuals
aic5.wge(tmlr_fit$residuals) 
## ---------WORKING... PLEASE WAIT... 
## 
## 
## Five Smallest Values of  aic
##       p    q        aic
## 5     1    1   10.65776
## 10    3    0   10.66787
## 6     1    2   10.67013
## 8     2    1   10.67525
## 7     2    0   10.67649
#The top pick is ARMA(1,1) again
fit1t = arima(Totalhosp1, order=c(1,0,1), xreg=cbind(newcases_7dayavg1,Time))
## Warning in arima(Totalhosp1, order = c(1, 0, 1), xreg =
## cbind(newcases_7dayavg1, : possible convergence problem: optim gave code =
## 1
fit1t #aic =1433, slightly better with time included
## 
## Call:
## arima(x = Totalhosp1, order = c(1, 0, 1), xreg = cbind(newcases_7dayavg1, Time))
## 
## Coefficients:
##          ar1     ma1  intercept  newcases_7dayavg1     Time
##       0.9736  0.3171   3556.898             0.0789  31.9194
## s.e.  0.0203  0.0955    946.776             0.1204  16.3045
## 
## sigma^2 estimated as 29525:  log likelihood = -710.87,  aic = 1433.73
#check residuals of model
plot(fit1t$residuals)

acf(fit1t$residuals) #appears to have no autocorrelation

lj24 <- ljung.wge(fit1t$residuals, p=1, q=1)
## Obs -0.03926346 0.02864523 0.02537815 0.04551755 0.03999161 0.09442141 0.1322683 0.1980511 -0.03952732 -0.1478463 -0.01064909 0.04227483 -0.01795223 0.1660975 0.009965204 0.01774608 -0.09759234 0.05039035 -0.1401209 0.06995263 0.1272211 0.1065629 -0.07193778 -0.03632591
lj24$pval
## [1] 0.3109216
lj48 <- ljung.wge(fit1t$residuals, p=1, q=1, K=48)
## Obs -0.03926346 0.02864523 0.02537815 0.04551755 0.03999161 0.09442141 0.1322683 0.1980511 -0.03952732 -0.1478463 -0.01064909 0.04227483 -0.01795223 0.1660975 0.009965204 0.01774608 -0.09759234 0.05039035 -0.1401209 0.06995263 0.1272211 0.1065629 -0.07193778 -0.03632591 -0.05790954 -0.02761053 0.01162714 0.1050361 0.009741847 -0.01373677 -0.1871925 -0.02793601 -0.05181441 -0.01300189 0.006677094 0.1047625 -0.1363873 -0.002084924 -0.09746059 -0.002928432 -0.03892722 0.02326122 0.06703607 0.0168426 -0.09753006 -0.1569328 -0.02946265 -0.0486311
lj48$pval
## [1] 0.3938294
#Ljung-Box test fails to reject H0 - no evidence against white noise
#forecast hospitalizations using prior forecast of 7 day avg new cases by mlp model
next7 = data.frame(new_cases_avg = fore_mlp_cases$mean, Time = seq(109,115,1))
f_mlr2 <- predict(fit1t, newxreg = next7, n.ahead = 7)
plot(seq(1,108,1), Totalhosp1, type = "l",xlim = c(0,115),ylim=c(4000,9000), xlab="days", ylab = "COVID-Related Hospitalized Patients", main = "7 Day Forecast - Linear Regression with Corr Errors Model")
lines(seq(109,115,1), f_mlr2$pred, type = "l", col = "red")
lines(seq(109,115,1), (f_mlr2$pred+f_mlr2$se), type = "l", col = "red",lty = 2)
lines(seq(109,115,1), (f_mlr2$pred-f_mlr2$se), type = "l", col = "red",lty = 2)

The model with time included as a variable in the MLR had a slightly lower aic, but the forecast is really anyone’s guess. For a seven-day prediction, the second (with time included in linear model variables) looks like a reasonable extention of the current trend. Again, this model is not suitable for a 90 day forecast based on our expectations.

The model was run again reserving the last 7 observations for calculating the ASE against a forecast.

Totalhosp1_train <- Totalhosp1[1:101]
Totalhosp1_test <- Totalhosp1[102:108]
  
fit1t_test = arima(Totalhosp1_train, order=c(1,0,1), xreg=cbind(newcases_7dayavg1[1:101],Time[1:101]))
next7_test = data.frame(new_cases_avg = newcases_7dayavg[102:108], time_t = Time[102:108])
f_mlr2_test <- predict(fit1t_test, newxreg = next7, n.ahead = 7)
ASE = mean((Totalhosp1_test - f_mlr2_test$pred)^2)
ASE
## [1] 176602.1

The plot of this overlayed forecast to calculate the ASE shows how this model doesn’t have quite as severe an incline, so it doesn’t depart as far from the dip at the end of the data.

plot(seq(1,108,1), Totalhosp1, type = "l",xlim = c(0,108),ylim=c(4000,9000), xlab="days", ylab = "COVID-Related Hospitalized Patients", main = "7 Day Forecast - Linear Regression with Corr Errors Model")
lines(seq(102,108,1), f_mlr2_test$pred, type = "l", col = "red")

California: Vector AR Models

We can use the same variables to model using VAR.

  1. Create matrix of variables
var_matrix1 <- cbind(newcases_7dayavg1, Totalhosp1)
  1. Model Hospitalizations and New Cases with VAR

We will model the data as if it is stationary; this might help against generating models with exponential inclines.

VARselect(var_matrix1, lag.max = 10, type = "both") #AIC picks 9, BIC picks 1
## $selection
## AIC(n)  HQ(n)  SC(n) FPE(n) 
##      9      1      1      8 
## 
## $criteria
##                   1            2            3            4            5
## AIC(n) 1.938112e+01 1.943825e+01 1.935951e+01 1.934788e+01 1.942175e+01
## HQ(n)  1.946647e+01 1.956627e+01 1.953021e+01 1.956126e+01 1.967781e+01
## SC(n)  1.959213e+01 1.975477e+01 1.978155e+01 1.987542e+01 2.005481e+01
## FPE(n) 2.613073e+08 2.767298e+08 2.558848e+08 2.531024e+08 2.727930e+08
##                   6            7            8            9           10
## AIC(n) 1.930603e+01 1.925005e+01 1.911073e+01 1.910819e+01 1.914237e+01
## HQ(n)  1.960477e+01 1.959146e+01 1.949481e+01 1.953495e+01 1.961181e+01
## SC(n)  2.004460e+01 2.009412e+01 2.006031e+01 2.016328e+01 2.030297e+01
## FPE(n) 2.433398e+08 2.305421e+08 2.010691e+08 2.012009e+08 2.090171e+08
vfit1_1 <- VAR(var_matrix1,p=9,type = "both")

Forecast using VAR model

#7 Day forecast
vpreds1_7 <- predict(vfit1_1,n.ahead = 7)
vpreds1_7$fcst$Totalhosp1
##          fcst    lower     upper       CI
## [1,] 8561.152 8328.223  8794.080 232.9283
## [2,] 8512.702 8184.889  8840.514 327.8128
## [3,] 8682.112 8292.366  9071.858 389.7459
## [4,] 8938.472 8504.740  9372.205 433.7322
## [5,] 9202.291 8721.534  9683.048 480.7570
## [6,] 9250.180 8724.203  9776.156 525.9764
## [7,] 9481.318 8887.768 10074.868 593.5497
#Plot 7 day forecast
plot(Time, Totalhosp1, type = "l",xlim = c(0,115),  ylim = c(4000,10000), ylab = "COVID-Related Hospitalized Patients", main = "7 Day Forecast - VAR Model")
lines(seq(109,115,1), vpreds1_7$fcst$Totalhosp1[,1], type = "l", col = "red")

Based on the data we have, the model created an exponential incline which again, we’ll consider a 90 day forecast based on this model to be unrealistic based on our expectations.

#90 Day forecast
vpreds1_90 <- predict(vfit1_1,n.ahead = 90)
#Plot 7 day forecast
plot(Time, Totalhosp1, type = "l",xlim = c(0,198),  ylim = c(4000,20000), ylab = "COVID-Related Hospitalized Patients", main = "90 Day Forecast - VAR Model")
lines(seq(109,198,1), vpreds1_90$fcst$Totalhosp1[,1], type = "l", col = "red")

As with the MLR model, we can backtrack and reserve the last 7 points to calculate ASE.

var_matrix1_train <- cbind(newcases_7dayavg1[1:101], Totalhosp1[1:101])
var_matrix1_test <- cbind(newcases_7dayavg1[102:108], Totalhosp1[102:108])
vfit1_train <- VAR(var_matrix1_train,p=9,type = "both")
vpreds1_train <- predict(vfit1_train,n.ahead = 7)
ASE = mean((Totalhosp1_test - vpreds1_train$fcst$y2[,1])^2)
ASE
## [1] 1240595

we can see that the ASE is much higher for this model because it generated a rapidly increasing trend that missed the dip at the end of the data.

plot(Time, Totalhosp1, type = "l",xlim = c(0,108),  ylim = c(4000,10000), ylab = "COVID-Related Hospitalized Patients", main = "7 Day Forecast - VAR Model")
lines(seq(102,108,1), vpreds1_train$fcst$y2[,1], type = "l", col = "red")

California: Multivariate MLP Models

We can start with the base mlp function as we did in the univariate analysis.

#create time series objects
tsth <- ts(Totalhosp1)
tsnc <- ts(newcases_7dayavg1)
#model hospitalizations with mlp
tsnc_df <- data.frame(tsnc)
mlp_fit_mult <- mlp(tsth, reps = 50, comb = "median", xreg = tsnc_df)
plot(mlp_fit_mult)

#add new cases forecast from earlier to use in multivariate mlp forecast
case_f <- as.numeric(fore_mlp_cases$mean)
case_df <- data.frame(c(newcases_7dayavg1, case_f))
#add the new df to the multivariate forecast
mlp_fore_mult <- forecast(mlp_fit_mult, h=7, xreg = case_df)
plot(mlp_fore_mult)

The 7 day forecast does no look unreasonable, but we can tune the parameters to optimize by lowest ASE.

Multivariate MLP with tuned Hyperparameters

As with the univariate MLP, we can tune the hyperparameters using windowed ASE to get an optimized model.

data_train.u <- data.frame(TotalHosp_wase1 = Totalhosp1[1:88], newcases_7dayavg1_wase1 = newcases_7dayavg1[1:88])
data_test.u <- data.frame(TotalHosp_wase1 = Totalhosp1[89:108], newcases_7dayavg1_wase1 = newcases_7dayavg1[89:108])
# search for best NN hyperparameters in given grid
model.u = tswgewrapped::ModelBuildNNforCaret$new(data = data_train.u, var_interest = "TotalHosp_wase1", search = 'random', tuneLength = 5, parallel = TRUE, batch_size = 50, h = 7, verbose = 1)
## Aggregating results
## Selecting tuning parameters
## Fitting reps = 16, hd = 1, allow.det.season = FALSE on full training set
## - Total Time for training: : 20.068 sec elapsed
  • The ASEs associated with the grid of hyperparameters is shown in the table below.
res.u <- model.u$summarize_hyperparam_results()
res.u
##   reps hd allow.det.season     RMSE      ASE   RMSESD    ASESD
## 1   10  2             TRUE 191.6448 39467.11 57.33450 22970.72
## 2   11  3            FALSE 206.1944 46434.87 68.57449 29678.06
## 3   16  1            FALSE 184.3065 36725.75 57.51728 20639.74
## 4   16  3             TRUE 207.7638 47263.78 70.12560 30760.63
## 5   22  3            FALSE 205.9433 46259.84 67.94564 29964.44
  • Windowed ASE: The best hyperparameters:
best.u <- model.u$summarize_best_hyperparams()
best.u
##   reps hd allow.det.season
## 3   16  1            FALSE

The ASE of the model using these hyperparameters is shown below:

final.ase.u <- dplyr::filter(res.u, reps == best.u$reps &
                    hd == best.u$hd &
                    allow.det.season == best.u$allow.det.season)[['ASE']]
final.ase.u
## [1] 36725.75
# Final Model
caret_model.u = model.u$get_final_models(subset = 'a')
caret_model.u$finalModel
## MLP fit with 1 hidden node and 16 repetitions.
## Univariate lags: (1)
## Forecast combined using the median operator.
## MSE: 26618.4708.
  • The final mlp model based on the tuned hyperparameters:
#Plot Final Model
plot(caret_model.u$finalModel)

Nothing too fancy here. Running the model to get forecasts based on the recommended hyperparameters, we will use the median since there are so few repetitions and we don’t want the final model to be heavily influenced by outliers.

mlp_mult2 <- mlp(tsth, reps = 16, hd=1, comb = "median", xreg = tsnc_df, allow.det.season = FALSE)
fore_mlp_mult_7 <- forecast(mlp_mult2, h=7, xreg = case_df) #1 week mlp forecast
plot(fore_mlp_mult_7)

The tuning process produced a model with a more drastic increasing trend than the model useing default parameters. Because of the constant increasing trend, we would again only use this model in a 7 day forecast.

Again for consistent comparison, this model was run reserving the last 7 observations to calculate the ASE against a 7 day forecast.

#create time series objects and dataframes for shortened data
tsth_train <- ts(Totalhosp1[1:101])
tsnc_train <- ts(newcases_7dayavg1[1:101])
tsnc_df_t <- data.frame(tsnc_train)
casedf_t <- data.frame(newcases_7dayavg1)
#model hospitalizations with mlp on shortened set
mlp_mult_train <- mlp(tsth_train, reps = 16, hd=1, comb = "median", xreg = tsnc_df_t, allow.det.season = FALSE)
fore_mlp_mult_7_ASE <- forecast(mlp_mult_train, h=7, xreg = casedf_t) #1 week mlp forecast overlay for ASE
#Calculate ASE
ASE = mean((Totalhosp1_test - fore_mlp_mult_7_ASE$mean)^2)
ASE
## [1] 272510.1

We can see again that the dip at the end of the data is causing the differences. The forecast follows the trend at the end of the data.

plot(Time, Totalhosp1, type = "l",xlim = c(0,108),  ylim = c(4000,10000), ylab = "COVID-Related Hospitalized Patients", main = "7 Day Forecast - MLP Model")
lines(seq(102,108,1), fore_mlp_mult_7_ASE$mean, type = "l", col = "red")

California: Quantitative Model Comparison by ASE

The ASE was calculated for each model by comparing 7 day forecasts to the last 7 points of the data. All multivariate models used the 7 Day Average of New Positive Cases as a predictor of Hospitalized Patients. The calculated values for each model’s ASE are as follows:

Univariate Models
  • ARMA(1,2) Stationary Model: 178,260
  • ARIMA(0,0,0) with d=2 (Non-stationary) Model: 379,683
  • Univariate MLP Ensemble (Mean) Model: 92,395
Multivariate Models
  • Multiple Linear Regression with Correlated Errors Model: 176,478
  • Vector AR model: 1,240,595
  • Multivariate MLP Ensemble (Median) Model: 272,510

California: Model Selections and Conclusions on Forecasting Current Hospitalized COVID Patients

Traditional methods for generating time series models for forecasting actually have very limited usefulness in this particular scenario with Hospitalized COVID-19 patients in California. Our analysis operated under the assumption that the number of cases and hospitalizations, regardless of any current trend, must eventually decline back to (nearly) zero if COVID-19 is to be considered as a novel virus. At the current time, hospitalizations are increasing in California, so non-stationary models generated based on past data continue this trend. Since we believe that a constant (or in the case of some of the models seen in this analysis, an exponentially) increasing trend is unlikely to continue for the next three months, none of these non-stationary models can be used for forecasting more than a few days. We’ve deemed this assumption to be reasonable based solely on our opinion formed from the past trends we’ve witnessed.

It is for this reason that the best model we have from this analysis for a three month forecast is the stationary ARMA(1,2) model that drives the forecast back toward the mean of the data we have.

For a one week forecast, the univariate MLP had the lowest ASE, but the multiple linear regression model using the last 7-day average of new positive cases had the lowest ASE of the multivariate models. Because of the volatility of future behaviour of this data, we don’t think that the model with lowest ASE is necessarily the best for forecasting. One advantage of the MLR model is that by using the average of the last 7 days of new cases as a predictor along with the current trend, we get a good blend of weighing future changes in hospitalizations based on its past behaviour and by the number of new cases. We learned from this model that there is a correlation between these two variables, but we could see in the plot of them that new cases went from being lower on the curve to crossing over and being higher on the curve than the number of hospitalized COVID patients. This could indicate that the number of new cases could become a poor predictor in the future, and we touched on this as part of our decision to not model new cases as the best indicator of virus severity. However for the next seven days, we would like to use the model that includes this predictor so we’ll go with the MLR model.

California: Final 7 Day and 90 Day Predictions with 95% Confidence Intervals

3 month Forecast with ARMA(1,2) model

fq <- fore.arma.wge(Totalhosp, phi = Hosp_estq$phi, theta = Hosp_estq$theta, n.ahead = 90, lastn = FALSE, limits = TRUE)

7 Day Forecast with MLR Model

plot(seq(1,108,1), Totalhosp1, type = "l",xlim = c(0,115),ylim=c(4000,9000), xlab="days", ylab = "COVID-Related Hospitalized Patients", main = "7 Day Forecast - Linear Regression with Corr Errors Model")
lines(seq(109,115,1), f_mlr2$pred, type = "l", col = "red")
lines(seq(109,115,1), (f_mlr2$pred+f_mlr2$se), type = "l", col = "red",lty = 2)
lines(seq(109,115,1), (f_mlr2$pred-f_mlr2$se), type = "l", col = "red",lty = 2)